Draft SpecificationDraft 3 20040907 mcl The GOSSiP architecture The fundamental unit of the GOSSiP architecture is the GOSSiP node. A GOSSiP node: * stores information about email sender identities * stores information about other GOSSiP nodes * adjusts this information according to direct experience and feedback * provides a reputation score when supplied with an identity * provides a confidence rating for each reputation score computed * accepts feedback on messages recently submitted by an identity A GOSSiP node deployment connects the GOSSiP node to three types of entities: * one or more Mail Transport Agents (MTAs) * zero or more other GOSSiP nodes (peers) * one or more feedback mechanisms A GOSSiP node accepts three forms of input: * a query from an MTA or peer * a query response from a peer * feedback from a feedback mechanism Structure of a GOSSiP Query A GOSSiP query consists of four parts: 1) A tag identifying the input as a query 2) The identity queried 3) A Time-To-Live (TTL) value 4) An Unique Message-Identification String (UMIS) Identities Queried identities consist of two parts: 1) The right-hand-side (RHS) of the RFC2821 MAIL FROM: address supplied by the sender (i.e., all characters to the right of the "@" symbol) 2) The IP address of the sender, as determined by the mail server when the sender connects, or some tag as determined by the MTA agent, e.g., indicating that the address has passed an authentication test. TTLs The TTL of a query constrains the depth to which a query will be propagated in the GOSSiP network. When a mailserver initiates a query, the TTL is set according to a value specified by the administrator at deployment time. The TTL will be decremented as the query is propagated through the network of peers. When the TTL reaches 0, the propagation will stop. Unique Message-Identiffication String (UMIS) The UMIS of a query is a unique string generated by the GOSSiP MTA agent and inserted into an incoming message as a GOSSiP-ID: header prior to spooling, if the MTA agent instructs the MTA to continue processing the message based on the response from the GOSSiP node. How Queries Are Received A GOSSiP node receives a query on a predetermined high-numbered port via TCP. When a connection is attempted, the GOSSiP node compares the source IP to a list of peers and MTAs. If the source IP matches one of the addresses on that list, the connection is allowed to proceed. Otherwise, the connection is reset. The connection between the MTA agent and the GOSSiP node may or may not be encrypted via SSL. The connections among peers are always SSL-encrypted. How Queries Are Processed Once a query connection has been accepted, the GOSSiP node accepts and parses the query. The connection is held open until the node supplies a response to the sender. The GOSSiP node checks the TTL value. If the TTL is equal to 0, the GOSSiP node responds with the reputation information and confidence value for that identity. Otherwise, the GOSSiP node decrements the TTL by 1 and passes along the query to each of its peers. The GOSSiP node waits for a response from each of its peers, or until a timeout occurs, whichever comes first. Once one of these two conditions have been satisfied, the GOSSiP node computes a reputation score and confidence value for the queried identity and responds to the original sender with this information. The original sender should acknowledge receipt of this information and close the TCP connection. If the original sender does not receive a response before a timeout occurs, the original sender may close the connection. Reputation Score Reputation is defined as a value indicative of an entity's behavior over time, as evaluated by one or more GOSSiP nodes. A GOSSiP node tracks reputation for: * sender identities * peers The reputation score is represented by the function: ((P(h) - P(s))k) e Reputation(i) = 200 * ( ------------------------- - 0.5 ) ((P(h) - P(s))k) 1 + e Which is essentially the Rasch logistic model. P(h) is the ratio of ham accounted for within N to the size of N. P(s) is the ratio of spam accounted for within N to the size of N. k is a constant which effectively adjusts the sensitivity of the reputation model to small changes in P(h)-P(s). Recommended values for k are in the range of 2-10. As k->0, Reputation(i) becomes more linear. i is a given identity. The Reputation Cache The GOSSiP node contains a FIFO cache of the last N local queries. Each entry in this cache is either 0 or 1, with 0 indicating spam and 1 indicating nonspam (a.k.a. "ham"). The value for N is set at deployment time, and should be sufficiently large to ensure a representative sample of data. The values in this reputation cache are obtained through the feedback mechanism. When an entry is made in the reputation cache, the seconds since epoch at the time of entry is recorded. This value is a single variable, and is updated every time a reputation cache entry is made. Each reputation cache has such a "time last seen" variable associated with it. This value is used in the computation of the confidence value for a reputation score. Confidence Value The confidence score is represented by the function: Confidence(R(i)) = [ sum(o) / N ] * age R(i) is the reputation score for a given identity i. sum(o) is the total number of observations stored in the reputation cache. N is the size of the reputation cache. age is represented by the following function: age(o) age = 100 * ( ---------- ) age(n) age(n) = seconds since epoch at time of computation. age(o) = seconds since epoch at time of last local transaction with identity. This ensures that the confidence value reflects not only the amount of data the node has for a given identity, but how old that data is. How Reputation is Computed When one of two conditions have been met: 1) The GOSSiP node has received a query with a TTL of 0 2) Zero or more of the GOSSiP node's peers have timed out while the node was waiting for a query response from them, and the remaining peers have supplied a query response the GOSSiP node checks its cache of recently-seen UMISes. If the UMIS in the query is not one the node has recently seen, it will compute a reputation score and confidence value for the identity, and respond to the source of the query with this data. The computation of reputation proceeds as follows: 1) The node computes a reputation score based on its own data 2) The node computes a confidence value based on its own data 3) The node adjusts reported reputation scores according to the associated confidence values 4) The node adjusts these adjusted reputation scores according to the reporting peer's reputation 5) The node aggregates its own reputation and confidence scores with those received from its peers 6) The node returns these aggregate scores to the original source of the query The reputation score in step 1 is computed according to the formula in the section titled, "Reputation Score". At this time, peer reputations are also updated (see the section titled, "Peer Reputation" for details). The confidence value in step 2 is computed according to the formula in the section titled, "Confidence Value". Once the node's own reputation score and confidence value have been computed, the node adjusts all reputation scores by their associated confidence values. The confidence value is in the range [0..100], and may be considered a percentage. Thus, the confidence adjustment for a reputation score R(i) for a given identity i is: C(R(i) aR(i) = R(i) * -------- 100 The node adjusts these new reputation scores according to how much the node trusts each peer (R(p)): If R(p) >= 0, aR(i) is unchanged. If R(p) < 0, aR(i) is decreased according to the following formula: R(p) TR(i) = aR(i) * abs( -------- ) 100 The node then computes the mean of all adjusted reputation scores for the current query (i.e., the mean of each reputation response from its peers and its own reputation score), and the standard deviation of each reputation score. Any reputation score whose standard deviation is greater than 3 is eliminated, and the mean of the remaining reputation scores is computed. The mean of the confidence scores associated with the remaining reputation scores is also computed. The node returns the averaged reputation score and the averaged confidence value to the original source of the query. Peer Reputation Each GOSSiP node computes the reputation of its peers, and uses this information to modulate responses received from each peer. The node keeps a FIFO cache of size M of interactions with each peer. The values stored in this cache are -1, 0, or 1. 0 indicates no data. 1 indicates a query response with which the node agreed. -1 indicates a query response with which the node disagreed. The peer reputation score is computed exactly as described in the section entitled, "Reputation Score". M is significantly smaller than the size of identity reputation caches, however. A value of 100 is reasonable, though a smaller value would certainly work as well. Agreement is determined at the time a node computes its own reputation score, prior to score aggregation and reporting. As with the reputation cache, the peer reputation cache is hashed such that each value is indexed by the UMIS to which it refers. The effect of peer reputation is meant to mediate the input of those peers that have repeatedly demonstrated an opinion that differs with that of the node itself. Feedback Mechanism A GOSSiP node may receive feedback on a high-numbered TCP port. The feedback is composed of two items: 1) The UMIS associated with a specific message, and an integer value indicating that the message was evaluated as spam (0) or ham (1). The GOSSiP node will accept this information and close the connection with the feedback mechanism. The UMIS will be compared with a FIFO cache of size P of recently seen UMIS strings, where P is sufficiently large to reflect the previous Y minutes of traffic, providing a limited window in which feedback may be submitted for a given message. If the UMIS is found in this cache, it is removed from the cache, and the evaluation is added to the identity's data. Relay-Only Nodes The GOSSiP architecture allows for the deployment of "relay-only" nodes -- nodes that would not track any data of their own, but act instead as agents that pass along queries among a set of disparate nodes. By using a wildcard in the Replied-To cache, the comparison will always succeed, forcing the node to exclude itself from participating in any reputation/confidence computation. The reason for wanting such relay-only nodes lay in the "small world" phenomenon, which relies on both strong and weak social associations. GOSSiP's immediate peer structure provides adequate stong association linkage, but the architecture lacks a good weak association mechanism. By deploying relay-only nodes that join otherwise disparate clusters of peers, the necessary weak associations are introduced. |