A Survey of Open Peer-to-Peer Technologies and their Applicability to Implementing an Organizational Information Repository

Abstract

Most large organizations lack an effective means to manage internal documentation and other critical information on a large scale. Peer-to-peer technologies have many of the right characteristics to successfully manage such information in a distributed and automated manner. To this end, a general survey of nonproprietary peer-to-peer technologies is undertaken, from the standpoint of their suitability for implementing an Intranet-based information repository.


1.0 Background and Terminology

The term "peer-to-peer" has been used to describe a wide variety of disparate network architectures: networks that connect end users to each other (regardless of internal structure), server-to-server networks (to which end users connect as clients), and even centralized client-server applications disguised to look to users like a peer-to-peer system. In the following discussion, "peer-to-peer" is used in the network architecture sense, i.e. to indicate a distributed system in which data is exchanged between two or more equal, autonomous, general-purpose entities; as opposed to a client-server model, in which roles are rigidly fixed and specialized.

Considerable attention has been given of late to file sharing user applications such as Napster, Gnutella, and FreeNet. However, it is of interest to look more broadly at the available technologies, including examining both older and newer peer-to-peer protocols such as NNTP, SMTP, and BEEP; peer-to-peer frameworks such as JXTA and SOAP; and the adaptability of client-server protocols such as HTTP and FTP to peer-to-peer use. (Napster, on the other hand, is in reality a proprietary and centrally controlled client-server technology, even though it looks to end users like a peer-to-peer system, so it is not of interest here).

1.1 Scope and General Goals

Most corporations and other organizations have a problem managing their own internal information, despite a number of major technological advances. As early as the 1970s, information would sometimes be kept on-line on a large centralized computer, but typically this information was standalone, hard to update, and difficult to access or search.

Throughout the 1980s, internal information was distributed to smaller and smaller systems, making access and maintenance easier but making it very difficult for any one person to assess what information actually existed organization-wide. This second phase culminated in a variety of application-specific document formats, such as word processing documents and spreadsheets, placed on departmental file servers or occasionally in a repository available organization-wide. Different departments frequently used different formats, and specific applications were generally required to read these documents, let alone update them. More costly yet, references to other documents were (and still are) most often included by wholesale incorporation, resulting in a serious maintenance problem as the documents required updating. The lack of an effective inter-document reference system also resulted in substantial duplication of effort.

The coming of the Intranet web server made a more effective system possible by in the mid-1990s, but even today web-based information stores have not reached their potential. The creators of internal content generally do not think in terms of web pages, and even when they do the process for maintaining them over time is not often defined. And vast amounts of information remain in standalone word-processing documents on departmental file servers, unsearchable and updated by hand everywhere a piece of data resides.

1.2 Characteristics of a Modern Information Repository

How is a modern organization to more effectively store and manage its internal data? A client-server information repository application could work, and in fact a number of them are available commercially; but the client-server structure favors centralized resources and a circuitous route from information producers to information consumers. In contrast, peer-to-peer structures facilitate distributed systems with simple paths for information to take from producer to consumer.

It is well worth considering the characteristics that will be required of a peer-to-peer information repository:

  • A distributed architecture is important to permit information to be produced and managed "locally", at the department level or even below.

  • The information should be globally searchable. In a distributed architecture, this is most effectively accomplished by having a distributed search capability designed into the system.

  • Mechanisms will be needed to automatically age information -- that is, its timeliness needs to be considered in its life cycle, from creation to updates and eventually to obsolescence. Some types of information should gradually disappear from view as they become less relevant; others need to be persistent and to be updated aggressively as time passes.

  • Publish-and-subscribe (P&S) mechanisms are an effective method for channeling information towards those people who need it, even as the information is being created and released.

Many other features are likely to be desirable for this application. To list a few:

  • very large scale, high-volume implementation
  • strong authentication of users
  • secure and effective user authorization architecture
  • partitioned/compartmented data regions or categories
  • reliable delivery of data objects
  • robustness, fault-tolerance, and survivability
  • tracking or logging of user actions and object delivery
  • straightforward, efficient implementation
  • efficient use of bandwidth and computing resources
  • flexible data and metadata formats
  • effective handling of a wide range of object types and sizes

2.0 The Technologies

This discussion will concentrate on truly open technologies with source code freely available without major restrictions (though some of the technologies discussed, including our own Beryllium project, are still in development). Furthermore, we will primarily consider implementations that are Internet Protocol (IP)-based, as opposed to proprietary or obsolete protocols such as IPX and NetBEUI. We will look at both native peer-to-peer protocols and peer-to-peer systems built over other protocols (such as HTTP). We will focus on protocol and architecture features that are relevant to establishing an effective infrastructure for applications, rather than on the high-level applications themselves [1].

2.1 Transport Protocols

Several data transport protocols are of interest for peer-to-peer applications. Some of these protocols are fundamentally peer-to-peer in nature, while others are client-server but have been used to build peer-to-peer systems. These protocols are listed roughly in order of age.

2.1.1 File Transfer Protocol (FTP)

FTP is an ancient client-server protocol, dating back to the earliest days of the Arpanet over 30 years ago, and in its earliest forms predating TCP/IP itself by several years. It is widely implemented as a method to access data repositories (often anonymously); for bulk data transfer; and to implement some peer-to-peer file sharing systems. It is efficient for transferring large objects, but carries high per-object and per-session overhead and relies on an antiquted security framework [2].

2.1.2 Network News Transfer Protocol (NNTP)

NNTP is a peer-to-peer (server-to-server) protocol used to implement the Usenet News system -- the "mother of all Publish-and-Subscribe systems". Usenet dates back to 1979 and originally ran over the Unix-to-Unix Copy (UUCP) network. NNTP was introduced in 1981 for operating Usenet News over the Internet, and by the end of the 1980s it was in widespread and high-volume use. Today NNTP has many extremely efficient, high-volume implementations, as well as being supported by many web and email clients [3].

Data is organized into hierarchically arranged "newsgroups", which can be explicitly routed and managed. Its redundant routing paths can efficiently create a highly robust and survivable system. Like SMTP (see Section 2.1.3 below), NNTP relies on the Multipurpose Internet Mail Extensions (MIME) to handle complex data formats. Also like SMTP, it cannot efficiently handle binary data or large objects, and lacks a meaningful security framework. NNTP's lack of security enforcement has enabled spammers to steadily erode the usefulness of Usenet News, to the point where it is today far less important than it once was.

2.1.3 Simple Mail Transfer Protocol (SMTP)

SMTP is a peer-to-peer protocol used as the basis for Internet email -- one of the most productive (and most abused) applications ever created. It is a simple and flexible message delivery mechanism; coupled with MIME it can be handle complex objects of many formats, although it cannot efficiently handle binary data or large objects, and lacks a meaningful security framework [4] [5]. Unless strong security enforcement is introduced soon, spamming will soon become such a serious problem that the usefulness of Internet email will go into decline, as happened to Usenet News several years ago.

2.1.4 HyperText Transfer Protocol (HTTP/1.1)

HTTP is a wildly popular client-server protocol, responsible for much of the explosive growth of the Internet throughout the 1990s. Due to its success and adaptability, however, it is now probably overused, as many developers think first "how should I fit this new application into our web server" rather than first considering whether something else would be more suitable. Adapting HTTP to bidirectional, peer-to-peer data transfers tends to result in a complex system, using two or more TCP sessions for each connection, too often using polling rather than data-driven semantics. In addition to the many client-server web applications, HTTP is the basis for many of the current peer-to-peer file sharing systems and many business-to-business and business-to-government data transfer applications [6]. HTTP relies heavily on Secure Sockets Layer (SSL) for its security framework, which can be made to be relatively secure given considerable diligence.

2.1.5 Internet Cache Protocol (ICP)

ICP is a highly efficient protocol for communications between caching web proxy servers. These servers query each other via ICP in order to quickly determine whether any neighboring proxies have a needed object, and to determine which of them is the most efficient source for the object [7]. ICP was the basis for the Harvest Cache network [8] and for the IRCache [9] project of the National Laboratory for Applied Network Research (NLANR). Today ICP is supported by the widely deployed Squid proxy server [10] and by many proprietary caching proxy servers. ICP implementations typically allow detailed access control lists (ACLs) specifying what particular sets of clients are permitted to access. However, these are usually based primarily on IP addresses or network location, rather than on strongly authenticated credentials.

2.1.6 BEEP

The Blocks Extensible Exchange Protocol (BEEP, formerly known as BXXP) protocol core is a protocol framework designed to be adaptable to a wide range of applications, most notably very efficient peer-to-peer data transfer. It permits the use of many separate bidirectional "channels" of data over a single TCP session [11] [12]. The BEEP Core has been implemented in several languages, and several applications, since the beginnings of its development in late 2000 [13].

BEEP's ability to get the most out of a single TCP session facilitates the use of session encryption and strong authentication, though security policies are implemented in associated applications rather than in the BEEP core itself.

See also the discussion of Beryl, a secure peer-to-peer message transfer application protocol based on BEEP, in Section 2.3.4 below.

2.2 Peer-to-Peer Applications

2.2.1 Gnutella

Gnutella is probably the most widely deployed and used true peer-to-peer file sharing application today. It is a peer-to-peer and distributed system in nearly every sense. It was created (in just two weeks) in March 2000 by Justin Frankel and Tom Pepper of Nullsoft with the goal of creating a system to share recipes. Gnutella was soon orphaned due to AOL's acquisition of Nullsoft and lack of interest in the project; but many others took up the challenge, creating several compatible and improved implementations. Gnutella is primarily a distributed monitoring and searching system layered on HTTP. It actually relies on HTTP itself to accomplish file downloads [14].

Gnutella in its original form has a number of major weaknesses, the most notorious of which is its great hunger for network bandwidth. This problem is being solved through specialized cache and summarization nodes and through adjustments to the protocol itself. Limitations will likely remain due to the lack of a security infrastructure or a structured or standardized search capability; but Gnutella's inherent simplicity and great popularity ensure that it will find its way into many more unforeseen applications down the road [15].

2.2.2 Mojo Nation

To the basic idea of a distributed file-sharing system, Mojo Nation adds a novel micropayment accounting system which grants "Mojo" to those providing resources to the network, and lets them spend it by uploading and downloading files (i.e. "using" resources). Payments of Mojo are overseen by a trusted third party (run by the developer, who also buys and sells Mojo in real-world currency).

Mojo Nation uses a local "broker" program to interface with the user's web client, and "relay servers" to enable users behind firewalls or with dynamic IP addresses to connect to the network.

Mojo Nation files are normally accompanied by searchable and structured metadata, though nameless, descriptionless documents are possible as well. Nodes do not store complete documents, but rather "shares" of the documents, spread across participating systems in a redundant, failsafe manner. In order to improve performance, these shares can many be downloaded from many systems at once ("swarm distribution"). Like Gnutella, Mojo Nation does not directly address security in any way.

Mojo Nation had 2000 active beta testers as of the autumn of 2000. As of this writing, it is rapidly approaching stability. A number of features are planned in the future to address payments ("tips") to authors of popular material [16].

2.2.3 Freenet

Freenet's goals also include providing a distributed file storage and retrieval system, but with serious consideration for the privacy and anonymity of those introducing and retrieving data from the system; as well as survivability in the face of large-scale, active attacks; resistance to censorship; and aggressive caching of data in order to reduce bandwidth consumption and improve performance. Freenet was originally designed by Ian Clarke in 1999. With the help of several volunteers, its development has been quite rapid since then [17] [18].

Freenet does not truly "store" files, it merely caches them; in current versions, if a file is not requested it disappears after a time period determined by the cache aging policies of individual nodes. Freenet is also currently completely unsearchable; even servers holding a document have no way of finding out what it contains. Heavy use of encryption and indirection result in a system with fairly high overhead when handling many small files.

2.2.4 Experimental Peer-to-Peer Applications

There are numerous additional peer-to-peer applications, which for the most part are not mature enough to be used as the basis for production applications. However, they may demonstrate useful approaches to solving difficult technical problems which will be encountered in implementing a distributed information repository.

2.3 Peer-to-Peer Application Frameworks

2.3.1 XML-RPC

XML-RPC (eXtensible Markup Language-Remote Procedure Call) is a specification governing loosely-coupled, operating-system-independent calls to other systems. As the name suggests, it uses HTTP as its transport protocol and XML for its encoding [19]. It was developed primarily by Dave Winer of Userland Software beginning in early 1998.

XML-RPC is similar to but not compatible with SOAP's RPC implementation; in particular it uses ordered parameters (while SOAP 1.1 RPC uses named parameters). To add further confusion to the matter, XML-RPC was originally called "SOAP" early in its development, but was renamed in April 1998. At this point XML-RPC is widely supported and implemented, but is being superseded by SOAP 1.1 implementations, which offer a superset of its functionality [20] [21].

2.3.2 SOAP 1.1

The Simple Object Access Protocol (SOAP) specifies a methodology for exchanging structured XML data between peers, interpreting this data, and carrying out SOAP remote procedure calls (RPCs) [22]. It has strong industry backing, including Microsoft, IBM, and Sun; and is being further refined by a World Wide Web Consortium (W3C) working group. SOAP is in theory protocol-neutral, but the vast majority of the considerable recent attention focused on SOAP of late has been directed at running it over HTTP. It has also been implemented over BEEP to good effect [23]. IBM's SOAP implementation in Java (since adopted by The Apache Group) can run over SMTP as well as HTTP [24].

SOAP grew out of XML-RPC work in early 1998. The current version, SOAP version 1.1, was introduced in May of 2000 [25]. To date dozens of implementations of SOAP have already been developed, but Microsoft's implementations are the most widespread. Interoperability between Microsoft and non-Microsoft SOAP implementations has not yet been thoroughly demonstrated.

2.3.3 JXTA

JXTA (not an acronym, but based on the word "juxtapose") is a Sun-led peer-to-peer application framework, consisting thus far of seven draft protocol specifications and a prototype implemented in Java. [26] [27] Projects are underway to create alternate implementations or at least APIs in several languages other than Java.

JXTA is specified for operation over HTTP, TCP itself, and directly over physical transports including Bluetooth. In principle it can be layered on other transport protocols as well. BEEP is being actively worked on, and even seems to be the preferred transport for some JXTA projects, but is not discussed in the main JXTA specifications.

Security is optional but well planned-for in JXTA. A number of security-oriented services are provided in the JXTA core. Several sample or demo applications have been created using JXTA, including distributed searching, file sharing, and information repositories. Much development remains to be done in many areas, but work on the project is quite active both inside and outside Sun (outside contributors include a number of independent P2P vendors).

2.3.4 Beryl and Beryllium

Beryl is a secure, peer-to-peer message transfer application being developed by Adeptech Systems, Inc. (ASI) using the BEEP core and framework, and SGI's File Alteration Monitor (FAM). Beryl can be used to manage and track data flows within a single system, as well as efficient system-to-system message transfers. [28]

Beryllium is a broader application framework comprising the Beryl application, plus a set of management tools and protocol-transparent APIs being developed by ASI (for use with Beryl and also with other common message transfer protocols). Some of the goals of the Beryllium project are similar to those of the JXTA project, but its actual implementation will be simpler and more efficient, and with a focus more narrowly placed on the transfer and tracking of messages and streaming data.


3.0 References

[1] M. Rose, IETF RFC3117 (Informational), "On the Design of Application Protocols", November 2001

[2] J. Postel, J. Reynolds, IETF STD0009 (RFC0959), "File Transfer Protocol", October 1985.

[3] B. Kantor, P. Lapsley, IETF RFC0977 (Proposed Standard), "Network News Transfer Protocol", February 1, 1986

[4] J. Klensin (Editor), IETF RFC2821 (Proposed Standard, replacing STD0010/RFC0821) "Simple Mail Transfer Protocol", April 2001

[5] P. Resnick (Editor), IETF RFC2822 (Proposed Standard, replacing STD0011/RFC0822), "Internet Message Format", April 2001

[6] R. Fielding, J. Gettys, J. Mogul, H. Frystyk, L. Masinter, P. Leach, T. Berners-Lee, IETF RFC2616 (Draft Standard), "Hypertext Transfer Protocol -- HTTP/1.1", June 1999

[7] D. Wessels, K. Claffy, IETF RFC2186 (Informational), "Internet Cache Protocol (ICP), version 2", September 1997

[8] Harvest Information Discovery and Access System

[9] IRCache Project Home Page

[10] Squid Web Proxy Cache

[11] M. Rose, IETF RFC3080 (Proposed Standard), "The Blocks Extensible Exchange Protocol Core", March 2001

[12] M. Rose, IETF RFC3081 (Proposed Standard), "Mapping the BEEP Core onto TCP", March 2001

[13] Beepcore Project Home Page

[14] Gnutella Project Home Page

[15] Gene Kan, "Peer-to-Peer: Harnessing the Power of Disruptive Technologies", March 2001, pp. 94-122

[16] Mojo Nation Project Home Page

[17] Freenet Project Home Page

[18] Adam Langley, "Peer-to-Peer: Harnessing the Power of Disruptive Technologies", March 2001, pp. 123-132

[19] XML-RPC Specification, June 15, 1999

[20] XML-RPC Home Page

[21] Dave Winer's SOAP News, December 7, 2001

[22] W3C SOAP 1.1 Specification, May 8, 2000

[23] Eamon O'Tuathai and Marshall Rose, Using SOAP in BEEP, October 8, 2001

[24] Bob DuCharme, A general-purpose Java SOAP client, May 2001

[25] Dave Winer's SOAP News, December 7, 2001

[26] JXTA Project Home Page

[27] JXTA v1.0 Protocols Specification, revision 1.2.13, December 17, 2001

[28] Rama Kant, Beryllium: A Secure Message Interconnectivity Solution, October 2001