draft-spec

News

FAQ

Documentation

Mailing List

Downloads

Links
Draft Specification
Draft 3
20040907
mcl

The GOSSiP architecture

The fundamental unit of the GOSSiP architecture is the GOSSiP node.  A
GOSSiP node:

 * stores information about email sender identities
 * stores information about other GOSSiP nodes
 * adjusts this information according to direct experience and feedback
 * provides a reputation score when supplied with an identity
 * provides a confidence rating for each reputation score computed
 * accepts feedback on messages recently submitted by an identity



A GOSSiP node deployment connects the GOSSiP node to three types of
entities:

 * one or more Mail Transport Agents (MTAs)
 * zero or more other GOSSiP nodes (peers)
 * one or more feedback mechanisms



A GOSSiP node accepts three forms of input:

 * a query from an MTA or peer
 * a query response from a peer
 * feedback from a feedback mechanism

Structure of a GOSSiP Query
 A GOSSiP query consists of four parts:
   1) A tag identifying the input as a query 
   2)  The identity queried
   3)  A Time-To-Live (TTL) value
   4)  An Unique Message-Identification String (UMIS)



Identities
 Queried identities consist of two parts:
   1) The right-hand-side (RHS) of the RFC2821 MAIL FROM: address
      supplied by the sender (i.e., all characters to the right of
      the "@" symbol)
   2) The IP address of the sender, as determined by the mail server
      when the sender connects, or some tag as determined by the
      MTA agent, e.g., indicating that the address has passed an
      authentication test.



TTLs
 The TTL of a query constrains the depth to which a query will be 
propagated in the GOSSiP network.  When a mailserver initiates a query, 
the TTL is set according to a value specified by the administrator at 
deployment time.  The TTL will be decremented as the query is propagated
through the network of peers.  When the TTL reaches 0, the propagation
will stop.



Unique Message-Identiffication String (UMIS)
 The UMIS of a query is a unique string generated by the GOSSiP MTA
agent and inserted into an incoming message as a GOSSiP-ID: header 
prior to spooling, if the MTA agent instructs the MTA to continue
processing the message based on the response from the GOSSiP node.




How Queries Are Received
  A GOSSiP node receives a query on a predetermined high-numbered port 
via TCP.  When a connection is attempted, the GOSSiP node compares the
source IP to a list of peers and MTAs.  If the source IP matches one of
the addresses on that list, the connection is allowed to proceed.  
Otherwise, the connection is reset.  The connection between the MTA
agent and the GOSSiP node may or may not be encrypted via SSL.  The
connections among peers are always SSL-encrypted.



How Queries Are Processed
  Once a query connection has been accepted, the GOSSiP node accepts
and parses the query.  The connection is held open until
the node supplies a response to the sender.  The GOSSiP node checks
the TTL value.  If the TTL is equal to 0, the GOSSiP node responds 
with the reputation information and confidence value for that 
identity.  Otherwise, the GOSSiP node decrements the TTL by 1 and
passes along the query to each of its peers.  The GOSSiP node waits
for a response from each of its peers, or until a timeout occurs,
whichever comes first.  Once one of these two conditions have been
satisfied, the GOSSiP node computes a reputation score and confidence
value for the queried identity and responds to the original sender
with this information.  The original sender should acknowledge receipt
of this information and close the TCP connection.  If the original
sender does not receive a response before a timeout occurs, the
original sender may close the connection.



Reputation Score
 Reputation is defined as a value indicative of an entity's behavior
over time, as evaluated by one or more GOSSiP nodes.

A GOSSiP node tracks reputation for:
 * sender identities
 * peers

The reputation score is represented by the function:
                              ((P(h) - P(s))k)
                             e
Reputation(i) = 200 * ( ------------------------- - 0.5 )
                              ((P(h) - P(s))k)
                         1 + e

Which is essentially the Rasch logistic model.  
P(h) is the ratio of ham accounted for within N to the size of N.
P(s) is the ratio of spam accounted for within N to the size of N.
k is a constant which effectively adjusts the sensitivity of the 
  reputation model to small changes in P(h)-P(s).  Recommended
  values for k are in the range of 2-10.  As k->0, Reputation(i)
  becomes more linear.
i is a given identity.



The Reputation Cache
 The GOSSiP node contains a FIFO cache of the last N local queries.
Each entry in this cache is either 0 or 1, with 0 indicating spam and 1 
indicating nonspam (a.k.a. "ham").  The value for N is set at deployment 
time, and should be sufficiently large to ensure a representative sample 
of data.

The values in this reputation cache are obtained through the feedback
mechanism.

When an entry is made in the reputation cache, the seconds since epoch
at the time of entry is recorded.  This value is a single variable, and
is updated every time a reputation cache entry is made.  Each reputation
cache has such a "time last seen" variable associated with it.  This
value is used in the computation of the confidence value for a
reputation score.



Confidence Value
 The confidence score is represented by the function:

Confidence(R(i)) = [ sum(o) / N ] * age

R(i) is the reputation score for a given identity i.
sum(o) is the total number of observations stored in the reputation cache.
N is the size of the reputation cache.

age is represented by the following function:
                 age(o)
age = 100 * (  ---------- )
                 age(n)

age(n) = seconds since epoch at time of computation.
age(o) = seconds since epoch at time of last local transaction with identity.

This ensures that the confidence value reflects not only the amount
of data the node has for a given identity, but how old that data is.


How Reputation is Computed
  When one of two conditions have been met:
    1) The GOSSiP node has received a query with a TTL of 0
    2) Zero or more of the GOSSiP node's peers have timed out while the
       node was waiting for a query response from them, and the remaining
       peers have supplied a query response

the GOSSiP node checks its cache of recently-seen UMISes.  If the UMIS
in the query is not one the node has recently seen, it will compute a
reputation score and confidence value for the identity, and respond to
the source of the query with this data.


The computation of reputation proceeds as follows:
  1) The node computes a reputation score based on its own data
  2) The node computes a confidence value based on its own data
  3) The node adjusts reported reputation scores according to the
     associated confidence values
  4) The node adjusts these adjusted reputation scores according to
     the reporting peer's reputation
  5) The node aggregates its own reputation and confidence scores with
     those received from its peers
  6) The node returns these aggregate scores to the original source of
     the query

The reputation score in step 1 is computed according to the formula in 
the section titled, "Reputation Score".  At this time, peer
reputations are also updated (see the section titled, "Peer Reputation"
for details).

The confidence value in step 2 is computed according to the formula in
the section titled, "Confidence Value".

Once the node's own reputation score and confidence value have been
computed, the node adjusts all reputation scores by their associated
confidence values.  The confidence value is in the range [0..100], and
may be considered a percentage.  Thus, the confidence adjustment for
a reputation score R(i) for a given identity i is:

                C(R(i)
aR(i) = R(i) * --------
                 100

The node adjusts these new reputation scores according to how much
the node trusts each peer (R(p)):

If R(p) >= 0, aR(i) is unchanged.
If R(p) < 0, aR(i) is decreased according to the following formula:

                       R(p)
TR(i) = aR(i) * abs( -------- )
                       100

The node then computes the mean of all adjusted reputation scores for
the current query (i.e., the mean of each reputation response from its
peers and its own reputation score), and the standard deviation of
each reputation score.  Any reputation score whose standard deviation
is greater than 3 is eliminated, and the mean of the remaining reputation
scores is computed.

The mean of the confidence scores associated with the remaining reputation
scores is also computed.

The node returns the averaged reputation score and the averaged confidence
value to the original source of the query.




Peer Reputation
 Each GOSSiP node computes the reputation of its peers, and uses this
information to modulate responses received from each peer.  The node
keeps a FIFO cache of size M of interactions with each peer.  The
values stored in this cache are -1, 0, or 1.  0 indicates no data.
1 indicates a query response with which the node agreed.  -1 indicates
a query response with which the node disagreed.  The peer reputation
score is computed exactly as described in the section entitled, "Reputation
Score".  M is significantly smaller than the size of identity reputation
caches, however.  A value of 100 is reasonable, though a smaller value
would certainly work as well.

Agreement is determined at the time a node computes its own reputation
score, prior to score aggregation and reporting.

As with the reputation cache, the peer reputation cache is hashed such that
each value is indexed by the UMIS to which it refers.

The effect of peer reputation is meant to mediate the input of those peers
that have repeatedly demonstrated an opinion that differs with that of the
node itself. 



Feedback Mechanism
 A GOSSiP node may receive feedback on a high-numbered TCP port.
The feedback is composed of two items:
  1) The UMIS associated with a specific message, and an integer
value indicating that the message was evaluated as spam (0) or ham (1).

The GOSSiP node will accept this information and close the connection with 
the feedback mechanism.  The UMIS will be compared with a FIFO cache of size
P of recently seen UMIS strings, where P is sufficiently large to reflect
the previous Y minutes of traffic, providing a limited window in which
feedback may be submitted for a given message.  If the UMIS is found in
this cache, it is removed from the cache, and the evaluation is added to
the identity's data.



Relay-Only Nodes
 The GOSSiP architecture allows for the deployment of "relay-only" nodes --
nodes that would not track any data of their own, but act instead as
agents that pass along queries among a set of disparate nodes.  By
using a wildcard in the Replied-To cache, the comparison will always
succeed, forcing the node to exclude itself from participating in any
reputation/confidence computation.

 The reason for wanting such relay-only nodes lay in the "small world"
phenomenon, which relies on both strong and weak social associations.
GOSSiP's immediate peer structure provides adequate stong association
linkage, but the architecture lacks a good weak association mechanism.
By deploying relay-only nodes that join otherwise disparate clusters
of peers, the necessary weak associations are introduced.