Serialization

Serialization is the process of converting a set of object instances that contain references to each other into a linear stream of bytes, which can then be sent through a socket, stored to a file, or simply manipulated as a stream of data. Serialization is the mechanism used by RMI to pass objects between JVMs, either as arguments in a method invocation from a client to a server or as return values from a method invocation. In the first section of this book, I referred to this process several times but delayed a detailed discussion until now. In this chapter, we drill down on the serialization mechanism; by the end of it, you will understand exactly how serialization works and how to use it efficiently within your applications.

The Need for Serialization

Envision the banking application while a client is executing a withdrawal. The part of the application we're looking at has the runtime structure shown in Figure 10-1.
Figure 10-1. Runtime structure when making a withdrawal
 

What does it mean for the client to pass an instance of Money to the server? At a minimum, it means that the server is able to call public methods on the instance of Money. One way to do this would be to implicitly make Money into a server as well.[1] For example, imagine that the client sends the following two pieces of information whenever it passes an instance as an argument:
  • The type of the instance; in this case, Money.
  • A unique identifier for the object (i.e., a logical reference). For example, the address of the instance in memory.





The RMI runtime layer in the server can use this information to construct a stub for the instance of Money, so that whenever the Account server calls a method on what it thinks of as the instance of Money, the method call is relayed over the wire, as shown in Figure 10-2.
Figure 10-2. Relaying a Money method call from the server
 

Attempting to do things this way has three significant drawbacks:
  • You can't access fields on the objects that have been passed as arguments. Stubs work by implementing an interface. They implement the methods in the interface by simply relaying the method invocation across the network. That is, the stub methods take all their arguments and simply marshall them for transport across the wire. Accessing a public field is really just dereferencing a pointer--there is no method invocation and hence, there isn't a method call to forward over the wire.
  • It can result in unacceptable performance due to network latency. Even in our simple case, the instance of Account is going to need to call getCents( ) on the instance of Money. This means that a simple call to makeDeposit( ) really involves at least two distinct networked method calls: makeDeposit( ) from the client and getCents( ) from the server.
  • It makes the application much more vulnerable to partial failure. Let's say that the server is busy and doesn't get around to handling the request for 30 seconds. If the client crashes in the interim, or if the network goes down, the server cannot process the request at all. Until all data has been requested and sent, the application is particularly vulnerable to partial failures.










This last point is an interesting one. Any time you have an application that requires a long-lasting and durable connection between client and server, you build in a point of failure. The longer the connection needs to last, or the higher the communication bandwidth the connection requires, the more likely the application is to occasionally break down.
TIP:   The original design of the Web, with its stateless connections, serves as a good example of a distributed application that can tolerate almost any transient network failure.
These three reasons imply that what is really needed is a way to copy objects and send them over the wire. That is, instead of turning arguments into implicit servers, arguments need to be completely copied so that no further network calls are needed to complete the remote method invocation. Put another way, we want the result of makeWithdrawal( ) to involve creating a copy of the instance of Money on the server side. The runtime structure should resemble Figure 10-3.
Figure 10-3. Making a remote method call can create deep copies of the arguments and return values
 

The desire to avoid unnecessary network dependencies has two significant consequences:
  • Once an object is duplicated, the two objects are completely independent of each other. Any attempt to keep the copy and the original in sync would involve propagating changes over the network, entirely defeating the reason for making the copy in the first place.
  • The copying mechanism must create deep copies. If the instance of Money references another instance, then copies must be made of both instances. Otherwise, when a method is called on the second object, the call must be relayed across the wire. Moreover, all the copies must be made immediately--we can't wait until the second object is accessed to make the copy because the original might change in the meantime.





These two consequences have a very important third consequence:
  • If an object is sent twice, in separate method calls, two copies of the object will be created. In addition to arguments to method calls, this holds for objects that are referenced by the arguments. If you pass object A, which has a reference to object C, and in another call you pass object B, which also has a reference to C, you will end up with two distinct copies of C on the receiving side.

Drilling Down on Object Creation

To see why this last point holds, consider a client that executes a withdrawal and then tries to cancel the transaction by making a deposit for the same amount of money. That is, the following lines of code are executed:
server.makeWithdrawal(amount);.... server.makeDeposit(amount);
The client has no way of knowing whether the server still has a copy of amount. After all, the server may have used it and then thrown the copy away once it was done. This means that the client has to marshall amount and send it over the wire to the server.
The RMI runtime can demarshall amount, which is the instance of Money the client sent. However, even if it has the previous object, it has no way (unless equals( ) has been overridden) to tell whether the instance it just demarshalled is equal to the previous object.
More generally, if the object being copied isn't immutable, then the server might change it. In this case, even if the two objects are currently equal, the RMI runtime has no way to tell if the two copies will always be equal and can potentially be replaced by a single copy. To see why, consider our Printer example again. At the end of Chapter 3, we considered a list of possible feature requests that could be made. One of them was the following:
Managers will want to track resource consumption. This will involve logging print requests and, quite possibly, building a set of queries that can be run against the printer's log.
This can be implemented by adding a few more fields to DocumentDescription and having the server store an indexed log of all the DocumentDescription objects it has received. For example, we may add the following fields to DocumentDescription:
public Time whenPrinted;public Person sender;public boolean printSucceeded;
Now consider what happens when the user actually wants to print two copies of the same document. The client application could call:
server.printDocument(document);
twice with the "same" instance of DocumentDescription. And it would be an error for the RMI runtime to create only one instance of DocumentDescription on the server side. Even though the "same" object is passed into the server twice, it is passed as parts of distinct requests and therefore as different objects.
TIP:   This is true even if the runtime can tell that the two instances of DocumentDescription are equal when it finishes demarshalling. An implementation of a printer may well have a notion of a job queue that holds instances of DocumentDescription. So our client makes the first call, and the copy of document is placed in the queue (say, at number 5), but not edited because the document hasn't been printed yet. Then our client makes the second call. At this point, the two copies of document are equal. However, we don't want to place the same object in the printer queue twice. We want to place distinct copies in the printer queue.
Thus, we come to the following conclusion: network latency, and the desire to avoid vulnerability to partial failures, force us to have a deep copy mechanism for most arguments to a remote method invocation. This copying mechanism has to make deep copies, and it cannot perform any validation to eliminate "extra" copies across methods.
TIP:   While this discussion provides examples of implementation decisions that force two copies to occur, it's important to note that, even without such examples, clients should be written as if the servers make independent copies. That is, clients are written to use interfaces. They should not, and cannot, make assumptions about server-side implementations of the interfaces.

Using Serialization

Serialization is a mechanism built into the core Java libraries for writing a graph of objects into a stream of data. This stream of data can then be programmatically manipulated, and a deep copy of the objects can be made by reversing the process. This reversal is often called deserialization.
In particular, there are three main uses of serialization:
As a persistence mechanism
If the stream being used is FileOutputStream, then the data will automatically be written to a file.


As a copy mechanism
If the stream being used is ByteArrayOutputStream, then the data will be written to a byte array in memory. This byte array can then be used to create duplicates of the original objects.


As a communication mechanism
If the stream being used comes from a socket, then the data will automatically be sent over the wire to the receiving socket, at which point another program will decide what to do.


















The important thing to note is that the use of serialization is independent of the serialization algorithm itself. If we have a serializable class, we can save it to a file or make a copy of it simply by changing the way we use the output of the serialization mechanism.
As you might expect, serialization is implemented using a pair of streams. Even though the code that underlies serialization is quite complex, the way you invoke it is designed to make serialization as transparent as possible to Java developers. To serialize an object, create an instance of ObjectOutputStream and call the writeObject( ) method; to read in a serialized object, create an instance of ObjectInputStream and call the readObject( ) object.

ObjectOutputStream

ObjectOutputStream, defined in the java.io package, is a stream that implements the "writing-out" part of the serialization algorithm.[2] The methods implemented by ObjectOutputStream can be grouped into three categories: methods that write information to the stream, methods used to control the stream's behavior, and methods used to customize the serialization algorithm.

The "write" methods

The first, and most intuitive, category consists of the "write" methods:
public void write(byte[] b); public void write(byte[] b, int off, int len); public void write(int data);public void writeBoolean(boolean data); public void writeByte(int data); public void writeBytes(String data); public void writeChar(int data);public void writeChars(String data); public void writeDouble(double data); public void writeFields(  ); public void writeFloat(float data); public void writeInt(int data); public void writeLong(long data); public void writeObject(Object obj);public void writeShort(int data); public void writeUTF(String s);public void defaultWriteObject(  );
For the most part, these methods should seem familiar. writeFloat( ), for example, works exactly as you would expect after reading Chapter 1--it takes a floating-point number and encodes the number as four bytes. There are, however, two new methods here: writeObject( ) and defaultWriteObject( ).
writeObject( ) serializes an object. In fact, writeObject( ) is often the instrument of the serialization mechanism itself. In the simplest and most common case, serializing an object involves doing two things: creating an ObjectOuptutStream and calling writeObject( ) with a single "top-level" instance. The following code snippet shows the entire process, storing an object--and all the objects to which it refers--into a file:
FileOutputStream underlyingStream = new FileOutputStream("C:\\temp\\test");ObjectOutputStream serializer = new ObjectOutputStream(underlyingStream);serializer.writeObject(serializableObject);
Of course, this works seamlessly with the other methods for writing data. That is, if you wanted to write two floats, a String, and an object to a file, you could do so with the following code snippet:
FileOutputStream underlyingStream = new FileOutputStream("C:\\temp\\test");ObjectOutputStream serializer = new ObjectOutputStream(underlyingStream);serializer.writeFloat(firstFloat);serializer.writeFloat(secongFloat);serializer.writeUTF(aString);serializer.writeObject(serializableObject);
TIP:   ObjectOutputStream's constructor takes an OutputStream as an argument. This is analagous to many of the streams we looked at in Chapter 1. ObjectOutputStream and ObjectInputStream are simply encoding and transformation layers. This enables RMI to send objects over the wire by opening a socket connection, associating the OutputStream with the socket connection, creating an ObjectOutputStream on top of the socket's OutputStream, and then calling writeObject( ).
The other new "write" method is defaultWriteObject(). defaultWriteObject( ) makes it much easier to customize how instances of a single class are serialized. However, defaultWriteObject( ) has some strange restrictions placed on when it can be called. Here's what the documentation says about defaultWriteObject( ):
Write the nonstatic and nontransient fields of the current class to this stream. This may only be called from the writeObject method of the class being serialized. It will throw the NotActiveException if it is called otherwise.
That is, defaultWriteObject( ) is a method that works only when it is called from another specific method at a particular time. Since defaultWriteObject( ) is useful only when you are customizing the information stored for a particular class, this turns out to be a reasonable restriction. We'll talk more about defaultWriteObject( ) later in the chapter, when we discuss how to make a class serializable.

The stream manipulation methods

ObjectOutputStream also implements four methods that deal with the basic mechanics of manipulating the stream:
public void reset(  );public void close(  );public void flush(  );public void useProtocolVersion(int version); 
With the exception of useProtocolVersion( ), these methods should be familiar. In fact, reset( ), close( ), and flush( ) are standard stream methods. useProtocolVersion( ), on the other hand, changes the version of the serialization mechanism that is used. This is necessary because the serialization format and algorithm may need to change in a way that's not backwards-compatible. If another application needs to read in your serialized data, and the applications will be versioning independently (or running in different versions of the JVM), you may want to standardize on a protocol version.
TIP:   There are two versions of the serialization protocol currently defined: PROTOCOL_VERSION_1 and PROTOCOL_VERSION_2. If you send serialized data to a 1.1 (or earlier) JVM, you should probably use PROTOCOL_VERSION_1. The most common case of this involves applets. Most applets run in browsers over which the developer has no control. This means, in particular, that the JVM running the applet could be anything, from Java 1.0.2 through the latest JVM. Most servers, on the other hand, are written using JDK1.2.2 or later.[3] If you pass serialized objects between an applet and a server, you should specify the serialization protocol.

Methods that customize the serialization mechanism

The last group of methods consists mostly of protected methods that provide hooks that allow the serialization mechanism itself, rather than the data associated to a particular class, to be customized. These methods are:
public ObjectOutputStream.PutField putFields(  ); protected void annotateClass(Class cl);protected void annotateProxyClass(Class cl); protected boolean enableReplaceObject(boolean enable); protected Object  replaceObject(Object obj);protected void drain(  ); protected void writeObjectOverride(Object obj); protected void writeClassDescriptor(ObjectStreamClass classdesc);protected void writeStreamHeader(  );
These methods are more important to people who tailor the serialization algorithm to a particular use or develop their own implementation of serialization. As such, they require a deeper understanding of the serialization algorithm. We'll discuss these methods in more detail later, after we've gone over the actual algorithm used by the serialization mechanism.

ObjectInputStream

ObjectInputStream, defined in the java.io package, implements the "reading-in" part of the serialization algorithm. It is the companion to ObjectOutputStream--objects serialized using ObjectOutputStream can be deserialized using ObjectInputStream. Like ObjectOutputStream, the methods implemented by ObjectInputStream can be grouped into three categories: methods that read information from the stream, methods that are used to control the stream's behavior, and methods that are used to customize the serialization algorithm.

The "read" methods

The first, and most intuitive, category consists of the "read" methods:
public int read(  );public int read(byte[] b, int off, int len); public boolean readBoolean(  );public byte readByte(  ); public char readChar(  );public double readDouble(  ); public float readFloat(  );public intreadInt(  );public long readLong(  );public Object readObject(  ); public short readShort(  );public byte readUnsignedByte(  );public short readUnsignedShort(  );public String readUTF(  );void defaultReadObject(  );
Just as with ObjectOutputStream's write( ) methods, these methods should be familiar. readFloat( ), for example, works exactly as you would expect after reading Chapter 1: it reads four bytes from the stream and converts them into a single floating-point number, which is returned by the method call. And, again as with ObjectOutputStream, there are two new methods here: readObject( ) and defaultReadObject( ).
Just as writeObject( ) serializes an object, readObject( ) deserializes it. Deserializing an object involves doing two things: creating an ObjectInputStream and then calling readObject( ). The following code snippet shows the entire process, creating a copy of an object (and all the objects to which it refers) from a file:
FileInputStream underlyingStream = new FileInputStream("C:\\temp\\test");ObjectInputStream deserializer = new ObjectInputStream(underlyingStream);Object deserializedObject = deserializer.readObject(  );
This code is exactly inverse to the code we used for serializing the object in the first place. If we wanted to make a deep copy of a serializable object, we could first serialize the object and then deserialize it, as in the following code example:
ByteArrayOutputStream memoryOutputStream = new ByteArrayOutputStream(  );ObjectOutputStream serializer = new ObjectOutputStream(memoryOutputStream);serializer.writeObject(serializableObject);serializer.flush(  ); ByteArrayInputStream memoryInputStream = new ByteArrayInputStream(memoryOutputStream.    toByteArray(  ));ObjectInputStream deserializer = new ObjectInputStream(memoryInputStream);Object deepCopyOfOriginalObject = deserializer.readObject(  );
This code simply places an output stream into memory, serializes the object to the memory stream, creates an input stream based on the same piece of memory, and runs the deserializer on the input stream. The end result is a deep copy of the object with which we started.

The stream manipulation methods

There are five basic stream manipulation methods defined for ObjectInputStream:
public boolean available(  );public void close(  ); public void readFully(byte[] data);                       public void readFully(byte[] data, int offset, int size); public int skipBytes(int len);
Of these, available( ) and skip( ) are methods first defined on InputStream. available( ) returns a boolean flag indicating whether data is immediately available, and close( ) closes the stream.
The three new methods are also straightforward. skipBytes( ) skips the indicated number of bytes in the stream, blocking until all the information has been read. And the two readFully( ) methods perform a batch read into a byte array, also blocking until all the data has been read in.

Methods that customize the serialization mechanism

The last group of methods consists mostly of protected methods that provide hooks, which allow the serialization mechanism itself, rather than the data associated to a particular class, to be customized. These methods are:
protected boolean enableResolveObject(boolean enable);                 protected Class resolveClass(ObjectStreamClass v); protected Object resolveObject(Object obj); protected class resolveProxyClass(String[] interfaces);protected ObjectStreamClass readClassDescriptor(  ); protected Object readObjectOverride(  );protected void readStreamHeader(  );public void registerValidation(ObjectInputValidation obj, int priority);public GetFields readFields(  );
These methods are more important to people who tailor the serialization algorithm to a particular use or develop their own implementation of serialization. Like before, they also require a deeper understanding of the serialization algorithm, so I'll hold off on discussing them right now.

How to Make a Class Serializable

So far, we've focused on the mechanics of serializing an object. We've assumed we have a serializable object and discussed, from the point of view of client code, how to serialize it. The next step is discussing how to make a class serializable.
There are four basic things you must do when you are making a class serializable. They are:
  1. Implement the Serializable interface.
  2. Make sure that instance-level, locally defined state is serialized properly.
  3. Make sure that superclass state is serialized properly.
  4. Override equals( ) and hashCode( ).















Let's look at each of these steps in more detail.

Implement the Serializable Interface

This is by far the easiest of the steps. The Serializable interface is an empty interface; it declares no methods at all. So implementing it amounts to adding "implements Serializable" to your class declaration.
Reasonable people may wonder about the utility of an empty interface. Rather than define an empty interface, and require class definitions to implement it, why not just simply make every object serializable? The main reason not to do this is that there are some classes that don't have an obvious serialization. Consider, for example, an instance of File. An instance of File represents a file. Suppose, for example, it was created using the following line of code:
File file = new File("c:\\temp\\foo");
It's not at all clear what should be written out when this is serialized. The problem is that the file itself has a different lifecyle than the serialized data. The file might be edited, or deleted entirely, while the serialized information remains unchanged. Or the serialized information might be used to restart the application on another machine, where "C:\\temp\\foo" is the name of an entirely different file.
Another example is provided by the Thread[4] class. A thread represents a flow of execution within a particular JVM. You would not only have to store the stack, and all the local variables, but also all the related locks and threads, and restart all the threads properly when the instance is deserialized.
TIP:   Things get worse when you consider platform dependencies. In general, any class that involves native code is not really a good candidate for serialization.

Make Sure That Instance-Level, Locally Defined State Is Serialized Properly

Class definitions contain variable declarations. The instance-level, locally defined variables (e.g., the nonstatic variables) are the ones that contain the state of a particular instance. For example, in our Money class, we declared one such field:
public class Money extends ValueObject { private int _cents; ....}
The serialization mechanism has a nice default behavior--if all the instance-level, locally defined variables have values that are either serializable objects or primitive datatypes, then the serialization mechanism will work without any further effort on our part. For example, our implementations of Account, such as Account_Impl, would present no problems for the default serialization mechanism:
public class Account_Impl extends UnicastRemoteObject implements Account { private Money _balance; ...}
While _balance doesn't have a primitive type, it does refer to an instance of Money, which is a serializable class.
If, however, some of the fields don't have primitive types, and don't refer to serializable classes, more work may be necessary. Consider, for example, the implementation of ArrayList from the java.util package. An ArrayList really has only two pieces of state:
public class ArrayList extends AbstractList implements List, Cloneable, java.io.    Serializable {   private Object elementData[]; private int size;...}
But hidden in here is a huge problem: ArrayList is a generic container class whose state is stored as an array of objects. While arrays are first-class objects in Java, they aren't serializable objects. This means that ArrayList can't just implement the Serializable interface. It has to provide extra information to help the serialization mechanism handle its nonserializable fields. There are three basic solutions to this problem:
  • Fields can be declared to be transient.
  • The writeObject( )/readObject( ) methods can be implemented.
  • serialPersistentFields can be declared.










Declaring transient fields

The first, and easiest, thing you can do is simply mark some fields using the transient keyword. In ArrayList, for example, elementData is really declared to be a transient field:
public class ArrayList extends AbstractList implements List, Cloneable, java.io.    Serializable {   private transient Object elementData[]; private int size;...}
This tells the default serialization mechanism to ignore the variable. In other words, the serialization mechanism simply skips over the transient variables. In the case of ArrayList, the default serialization mechanism would attempt to write out size, but ignore elementData entirely.
This can be useful in two, usually distinct, situations:
The variable isn't serializable
If the variable isn't serializable, then the serialization mechanism will throw an exception when it tries to serialize the variable. To avoid this, you can declare the variable to be transient.


The variable is redundant
Suppose that the instance caches the result of a computation. Locally, we might want to store the result of the computation, in order to save some processor time. But when we send the object over the wire, we might worry more about consuming bandwidth and thus discard the cached computation since we can always regenerate it later on.









Implementing writeObject() and readObject( )

Suppose that the first case applies. A field takes values that aren't serializable. If the field is still an important part of the state of our instance, such as elementData in the case of an ArrayList, simply declaring the variable to be transient isn't good enough. We need to save and restore the state stored in the variable. This is done by implementing a pair of methods with the following signatures:
private void writeObject(java.io.ObjectOutputStream out) throws IOExceptionprivate void readObject(java.io.ObjectInputStream in) throws IOException,    ClassNotFoundException;
When the serialization mechanism starts to write out an object, it will check to see whether the class implements writeObject( ). If so, the serialization mechanism will not use the default mechanism and will not write out any of the instance variables. Instead, it will call writeObject( ) and depend on the method to store out all the important state. Here is ArrayList's implementation of writeObject( ):
private synchronized void writeObject(java.io.ObjectOutputStream stream) throws java.    io.IOException { stream.defaultWriteObject(  );  stream.writeInt(elementData.length); for (int i=0; i<size; i++)    stream.writeObject(elementData[i]); }
The first thing this does is call defaultWriteObject( ). defaultWriteObject( ) invokes the default serialization mechanism, which serializes all the nontransient, nonstatic instance variables. Next, the method writes out elementData.length and then calls the stream's writeObject( ) for each element of elementData.
There's an important point here that is sometimes missed: readObject( ) and writeObject( ) are a pair of methods that need to be implemented together. If you do any customization of serialization inside one of these methods, you need to implement the other method. If you don't, the serialization algorithm will fail.

Unit Tests and Serialization

Unit tests are used to test a specific piece of functionality in a class. They are explicitly not end-to-end or application-level tests. It's often a good idea to adopt a unit-testing harness such as JUnit when developing an application. JUnit gives you an automated way to run unit tests on individual classes and is available from http://www.junit.org.
If you adopt a unit-testing methodology, then any serializable class should pass the following three tests:
  • If it implements readObject( ), it should implement writeObject( ), and vice-versa.
  • It is equal (using the equals( ) method) to a serialized copy of itself.
  • It has the same hashcode as a serialized copy of itself.










Similar constraints hold for classes that implement the Externalizable interface.

Declaring serialPersistentFields

The final option that can be used is to explicitly declare which fields should be stored by the serialization mechanism. This is done using a special static final variable called serialPersistentFields, as shown in the following code snippet:
private static final ObjectStreamField[] serialPersistentFields =  { new 
    ObjectStreamField("size", Integer.TYPE), .... };
This line of code declares that the field named size, which is of type int, is a serial persistent field and will be written to the output stream by the serialization mechanism. Declaring serialPersistentFields is almost the opposite of declaring some fields transient. The meaning of transient is, "This field shouldn't be stored by serialization," and the meaning of serialPersistentFields is, "These fields should be stored by serialization."
But there is one important difference between declaring some variables to be transient and others to be serialPersistentFields. In order to declare variables to be transient, they must be locally declared. In other words, you must have access to the code that declares the variable. There is no such requirement for serialPersistentFields. You simply provide the name of the field and the type.
TIP:   What if you try to do both? That is, suppose you declare some variables to be transient, and then also provide a definition for serialPersistentFields? The answer is that the transient keyword is ignored; the definition of serialPersistentFields is definitive.
So far, we've talked only about instance-level state. What about class-level state? Suppose you have important information stored in a static variable? Static variables won't get saved by serialization unless you add special code to do so. In our context, (shipping objects over the wire between clients and servers), statics are usually a bad idea anyway.

Make Sure That Superclass State Is Handled Correctly

After you've handled the locally declared state, you may still need to worry about variables declared in a superclass. If the superclass implements the Serializable interface, then you don't need to do anything. The serialization mechanism will handle everything for you, either by using default serialization or by invoking writeObject( )/readObject( ) if they are declared in the superclass.
If the superclass doesn't implement Serializable, you will need to store its state. There are two different ways to approach this. You can use serialPersistentFields to tell the serialization mechanism about some of the superclass instance variables, or you can use writeObject( )/readObject( ) to handle the superclass state explicitly. Both of these, unfortunately, require you to know a fair amount about the superclass. If you're getting the .class files from another source, you should be aware that versioning issues can cause some really nasty problems. If you subclass a class, and that class's internal representation of instance-level state changes, you may not be able to load in your serialized data. While you can sometimes work around this by using a sufficiently convoluted readObject( ) method, this may not be a solvable problem. We'll return to this later. However, be aware that the ultimate solution may be to just implement the Externalizable interface instead, which we'll talk about later.
Another aspect of handling the state of a nonserializable superclass is that nonserializable superclasses must have a zero-argument constructor. This isn't important for serializing out an object, but it's incredibly important when deserializing an object. Deserialization works by creating an instance of a class and filling out its fields correctly. During this process, the deserialization algorithm doesn't actually call any of the serialized class's constructors, but does call the zero-argument constructor of the first nonserializable superclass. If there isn't a zero-argument constructor, then the deserialization algorithm can't create instances of the class, and the whole process fails.
WARNING:   If you can't create a zero-argument constructor in the first nonserializable superclass, you'll have to implement the Externalizable interface instead.
Simply adding a zero-argument constructor might seem a little problematic. Suppose the object already has several constructors, all of which take arguments. If you simply add a zero-argument constructor, then the serialization mechanism might leave the object in a half-initialized, and therefore unusable, state.
However, since serialization will supply the instance variables with correct values from an active instance immediately after instantiating the object, the only way this problem could arise is if the constructors actually do something with their arguments--besides setting variable values.
If all the constructors take arguments and actually execute initialization code as part of the constructor, then you may need to refactor a bit. The usual solution is to move the local initialization code into a new method (usually named something like initialize( )), which is then called from the original constructor:
public  MyObject(arglist) {// set local variables from arglist// perform local initialization}
to something that looks like:
private MyObject(  ) {    // zero argument constructor, invoked by serialization and never by any other    // piece of code.      // note that it doesn't call initialize(  )} public void MyObject(arglist) {// set local variables from arglist initialize(  );} private void initialize(  ) {// perform local initialization}
After this is done, writeObject( )/readObject( ) should be implemented, and readObject( ) should end with a call to initialize( ). Sometimes this will result in code that simply invokes the default serialization mechanism, as in the following snippet:
private  void writeObject(java.io.ObjectOutputStream stream) throws 
    java.io.IOException { stream.defaultWriteObject(  ); } private  void readObject(java.io.ObjectInputStream stream) throws 
    java.io.IOException { stream.defaultReadObject(  );  intialize(  );}
TIP:   If creating a zero-argument constructor is difficult (for example, you don't have the source code for the superclass), your class will need to implement the Externalizable interface instead of Serializable.

Override equals( ) and hashCode( ) if Necessary

The default implementations of equals( ) and hashCode( ), which are inherited from java.lang.Object, simply use an instance's location in memory. This can be problematic. Consider our previous deep copy code example:
ByteArrayOutputStream memoryOutputStream = new ByteArrayOutputStream(  );ObjectOutputStream serializer = new ObjectOutputStream(memoryOutputStream);serializer.writeObject(serializableObject);serializer.flush(  ); ByteArrayInputStream memoryInputStream = new ByteArrayInputStream(memoryOutputStream.    toByteArray(  ));ObjectInputStream deserializer = new ObjectInputStream(memoryInputStream);Object deepCopyOfOriginalObject = deserializer.readObject(  );
The potential problem here involves the following boolean test:
serializableObject.equals(deepCopyOfOriginalObject)
Sometimes, as in the case of Money and DocumentDescription, the answer should be true. If two instances of Money have the same values for _cents, then they are equal. However, the implementation of equals( ) inherited from Object will return false.
The same problem occurs with hashCode( ). Note that Object implements hashCode( ) by returning the memory address of the instance. Hence, no two instances ever have the same hashCode( ) using Object's implementation. If two objects are equal, however, then they should have the same hashcode. So if you need to override equals( ), you probably need to override hashCode( ) as well.
TIP:  With the exception of declaring variables to be transient, all our changes involve adding functionality. Making a class serializable rarely involves significant changes to its functionality and shouldn't result in any changes to method implementations. This means that it's fairly easy to retrofit serialization onto an existing object hierarchy. The hardest part is usually implementing equals( ) and hashCode( ).

Making DocumentDescription Serializable

To make this more concrete, we now turn to the DocumentDescription class from the RMI version of our printer server. The code for the first nonserializable version of DocumentDescription was the following:
public class DocumentDescription implements PrinterConstants { private InputStream _actualDocument; private int _length; private int _documentType; private boolean _printTwoSided; private int _printQuality;  public DocumentDescription(InputStream actualDocument)  throws IOException {        this(actualDocument, DEFAULT_DOCUMENT_TYPE, DEFAULT_PRINT_TWO_SIDED,         DEFAULT_PRINT_QUALITY); }  public DocumentDescription(InputStream actualDocument, int documentType, boolean         printTwoSided, int printQuality)  throws IOException {  _documentType = documentType;  _printTwoSided = printTwoSided;  _printQuality = printQuality;  BufferedInputStream buffer = new BufferedInputStream(actualDocument);  DataInputStream dataInputStream = new DataInputStream(buffer);  ByteArrayOutputStream temporaryBuffer = new ByteArrayOutputStream(  );  _length = copy(dataInputStream, new DataOutputStream(temporaryBuffer));  _actualDocument = new DataInputStream(new             ByteArrayInputStream(temporaryBuffer.toByteArray(  ))); }  public int getDocumentType(  ) {  return _documentType; }  public boolean isPrintTwoSided(  ) {  return _printTwoSided; }  public int getPrintQuality(  ) {  return _printQuality; }  private int copy(InputStream source, OutputStream destination) throws         IOException {  int nextByte;  int numberOfBytesCopied = 0;  while(-1!= (nextByte = source.read(  ))) {   destination.write(nextByte);   numberOfBytesCopied++;  }  destination.flush(  );  return numberOfBytesCopied; }}
We will make this into a serializable class by following the steps outlined in the previous section.

Implement the Serializable interface

This is easy. All we need to do is change the class declaration:
 public class DocumentDescription implements Serialiazble, PrinterConstants

Make sure that instance-level, locally defined state is serialized properly

We have five fields to take care of:
private InputStream _actualDocument;private int _length;private int _documentType;private boolean _printTwoSided;private int _printQuality;
Of these, four are primitive types that serialization can handle without any problem. However, _actualDocument is a problem. InputStream is not a serializable class. And the contents of _actualDocument are very important; _actualDocument contains the document we want to print. There is no point in serializing an instance of DocumentDescription unless we somehow serialize _actualDocument as well.
If we have fields that serialization cannot handle, and they must be serialized, then our only option is to implement readObject( ) and writeObject( ). For Document-Description, we declare _actualDocument to be transient and then implement readObject( ) and writeObject( ) as follows:
private transient InputStream _actualDocument;  private void writeObject(java.io.ObjectOutputStream out) throws IOException {  out.defaultWriteObject(  );  copy(_actualDocument, out); }  private void readObject(java.io.ObjectInputStream in) throws IOException,        ClassNotFoundException {  in.defaultReadObject(  );  ByteArrayOutputStream temporaryBuffer = new ByteArrayOutputStream(  );  copy(in, temporaryBuffer, _length);  _actualDocument = new DataInputStream(new             ByteArrayInputStream(temporaryBuffer.toByteArray(  ))); } private void copy(InputStream source, OutputStream destination, int length)        throws IOException {  int counter;  int nextByte;  for (counter = 0; counter <length; counter++) {   nextByte = source.read(  );   destination.write(nextByte);  }  destination.flush(  ); }
Note that we declare _actualDocument to be transient and call defaultWriteObject( ) in the first line of our writeObject( ) method. Doing these two things allows the standard serialization mechanism to serialize the other four instance variables without any extra effort on our part. We then simply copy _actualDocument to the stream.
Our implementation of readObject( ) simply calls defaultReadObject( ) and then reads _actualDocument from the stream. In order to read _actualDocument from the stream, we used the length of the document, which had previously been written to the stream. In essence, we needed to encode some metadata into the stream, in order to correctly pull our data out of the stream.
This code is a little ugly. We're using serialization, but we're still forced to think about how to encode some of our state when we're sending it out of the stream. In fact, the code for writeObject( ) and readObject( ) is remarkably similar to the marshalling code we implemented directly for the socket-based version of the printer server. This is, unfortunately, often the case. Serialization's default implementation handles simple objects very well. But, every now and then, you will want to send a nonserializable object over the wire, or improve the serialization algorithm for efficiency. Doing so amounts to writing the same code you write if you implement all the socket handling yourself, as in our socket-based version of the printer server.
TIP:   There is also an order dependency here. The first value written must be the first value read. Since we start writing by calling defaultWriteObject( ), we have to start reading by calling default-ReadObject( ). On the bright side, this means we'll have an accurate value for _length before we try to read _actualDocument from the stream.

Make sure that superclass state is handled correctly

This isn't a problem. The superclass, java.lang.Object, doesn't actually have any important state that we need to worry about. Since it also already has a zero-argument constructor, we don't need to do anything.

Override equals() and hashCode( ) if necessary

In our current implementation of the printer server, we don't need to do this. The server never checks for equality between instances of DocumentDescription. Nor does it store them in a container object that relies on their hashcodes.

Did We Cheat When Implementing Serializable for DocumentDescription?

It may seem like we cheated a bit in implementing DocumentDescription. Three of the five steps in making a class serializable didn't actually result in changes to the code. Indeed, the only work we really did was implementing readObject( ) and writeObject( ). But it's not really cheating. Serialization is just designed to be easy to use. It has a good set of defaults, and, at least in the case of value objects intended to be passed over the wire, the default behavior is often good enough.

The Serialization Algorithm

By now, you should have a pretty good feel for how the serialization mechanism works for individual classes. The next step in explaining serialization is to discuss the actual serialization algorithm in a little more detail. This discussion won't handle all the details of serialization.[5] Instead, the idea is to cover the algorithm and protocol, so you can understand how the various hooks for customizing serialization work and how they fit into the context of an RMI application.

The Data Format

The first step is to discuss what gets written to the stream when an instance is serialized. Be warned: it's a lot more information than you might guess from the previous discussion.
An important part of serialization involves writing out class-related metadata associated with an instance. Most instances are more than one class. For example, an instance of String is also an instance of Object. Any given instance, however, is an instance of only a few classes. These classes can be written as a sequence: C1, C2...CN, in which C1 is a superclass of C2, C2 is a superclass of C3, and so on. This is actually a linear sequence because Java is a single inheritance language for classes. We call C1 the least superclass and CN the most-derived class. See Figure 10-4.
Figure 10-4. Inheritance diagram
 

After writing out the associated class information, the serialization mechanism stores out the following information for each instance:
  • A description of the most-derived class.
  • Data associated with the instance, interpreted as an instance of the least superclass.
  • Data associated with the instance, interpreted as an instance of the second least superclass.










And so on until:
  • Data associated with the instance, interpreted as an instance of the most-derived class.
So what really happens is that the type of the instance is stored out, and then all the serializable state is stored in discrete chunks that correspond to the class structure. But there's a question still remaining: what do we mean by "a description of the most-derived class?" This is either a reference to a class description that has already been recorded (e.g., an earlier location in the stream) or the following information:
  • The version ID of the class, which is an integer used to validate the .class files
  • A boolean stating whether writeObject( )/readObject( ) are implemented
  • The number of serializable fields
  • A description of each field (its name and type)
  • Extra data produced by ObjectOutputStream's annotateClass( ) method
  • A description of its superclass if the superclass is serializable

























This should, of course, immediately seem familiar. The class descriptions consist entirely of metadata that allows the instance to be read back in. In fact, this is one of the most beautiful aspects of serialization; the serialization mechanism automatically, at runtime, converts class objects into metadata so instances can be serialized with the least amount of programmer work.

A Simplified Version of the Serialization Algorithm

In this section, I describe a slightly simplified version of the serialization algorithm. I then proceed to a more complete description of the serialization process in the next section.

Writing

Because the class descriptions actually contain the metadata, the basic idea behind the serialization algorithm is pretty easy to describe. The only tricky part is handling circular references.
The problem is this: suppose instance A refers to instance B. And instance B refers back to instance A. Completely writing out A requires you to write out B. But writing out B requires you to write out A. Because you don't want to get into an infinite loop, or even write out an instance or a class description more than once,[6] you need to keep track of what's already been written to the stream.
ObjectOutputStream does this by maintaining a mapping from instances and classes to handles. When writeObject( ) is called with an argument that has already been written to the stream, the handle is written to the stream, and no further operations are necessary.
If, however, writeObject( ) is passed an instance that has not yet been written to the stream, two things happen. First, the instance is assigned a reference handle, and the mapping from instance to reference handle is stored by ObjectOutputStream. The handle that is assigned is the next integer in a sequence.
TIP:   Remember the reset( ) method on ObjectOutputStream? It clears the mapping and resets the handle counter to 0x7E0000 .RMI also automatically resets its serialization mechanism after every remote method call.
Second, the instance data is written out as per the data format described earlier. This can involve some complications if the instance has a field whose value is also a serializable instance. In this case, the serialization of the first instance is suspended, and the second instance is serialized in its place (or, if the second instance has already been serialized, the reference handle for the second instance is written out). After the second instance is fully serialized, serialization of the first instance resumes. The contents of the stream look a little bit like Figure 10-5.
Figure 10-5. Contents of Serialization's data stream
 

Reading

From the description of writing, it's pretty easy to guess most of what happens when readObject( ) is called. Unfortunately, because of versioning issues, the implementation of readObject( ) is actually a little bit more complex than you might guess.
When it reads in an instance description, ObjectInputStream gets the following information:
  • Descriptions of all the classes involved
  • The serialization data from the instance





The problem is that the class descriptions that the instance of ObjectInputStream reads from the stream may not be equivalent to the class descriptions of the same classes in the local JVM. For example, if an instance is serialized to a file and then read back in three years later, there's a pretty good chance that the class definitions used to serialize the instance have changed.
This means that ObjectInputStream uses the class descriptions in two ways:
  • It uses them to actually pull data from the stream, since the class descriptions completely describe the contents of the stream.
  • It compares the class descriptions to the classes it has locally and tries to determine if the classes have changed, in which case it throws an exception. If the class descriptions match the local classes, it creates the instance and sets the instance's state appropriately.





RMI Customizes the Serialization Algorithm

RMI doesn't actually use ObjectOutputStream and ObjectInputStream. Instead, it uses custom subclasses so it can modify the serialization process by overriding some protected methods. In this section, we'll discuss the most important modifications that RMI makes when serializing instances. RMI makes similar changes when deserializing instances, but they follow from, and can easily be deduced from, the description of the serialization changes.
Recall that ObjectOutputStream contained the following protected methods:
protected void annotateClass(Class cl) protected void annotateProxyClass(Class cl) protected boolean enableReplaceObject(boolean enable) protected Object replaceObject(Object obj) protected void drain(  ) protected void writeObjectOverride(Object obj) protected void writeClassDescriptor(ObjectStreamClass classdesc)protected void writeStreamHeader(  )    
These all have default implementations in ObjectOutputStream. That is, annotateClass( ) and annotateProxyClass( ) do nothing. enableReplaceObject( ) returns false, and so on. However, these methods are still called during serialization. And RMI, by overriding these methods, customizes the serialization process.
The three most important methods from the point of view of RMI are:
protected void annotateClass(Class cl) protected boolean enableReplaceObject(boolean enable) protected Object replaceObject(Object obj) 
Let's describe how RMI overrides each of these.

annotateClass( )

ObjectOutputStream calls annotateClass( ) when it writes out class descriptions. Annotations are used to provide extra information about a class that comes from the serialization mechanism and not from the class itself. The basic serialization mechanism has no real need for annotations; most of the information about a given class is already stored in the stream.
TIP:   RMI's dynamic classloading system uses annotateClass( ) to record where .class files are stored. We'll discuss this more in Chapter 19.
RMI, on the other hand, uses annotations to record codebase information. That is, RMI, in addition to recording the class descriptions, also records information about the location from which it loaded the class's bytecode. Codebases are often simply locations in a filesystem. Incidentally, locations in a filesystem are often useless information, since the JVM that deserializes the instances may have a very different filesystem than the one from where the instances were serialized. However, a codebase isn't restricted to being a location in a filesystem. The only restriction on codebases is that they have to be valid URLs. That is, a codebase is a URL that specifies a location on the network from which the bytecode for a class can be obtained. This enables RMI to dynamically load new classes based on the serialized information in the stream. We'll return to this in Chapter 19.

replaceObject( )

The idea of replacement is simple; sometimes the instance that is passed to the serialization mechanism isn't the instance that ought to be written out to the data stream. To make this more concrete, recall what happened when we called rebind( ) to register a server with the RMI registry. The following code was used in the bank example:
Account_Impl newAccount = new Account_Impl(serverDescription.balance);Naming.rebind(serverDescription.name, newAccount);System.out.println("Account " + serverDescription.name + " successfully launched.");
This creates an instance of Account_Impl and then calls rebind( ) with that instance. Account_Impl is a server that implements the Remote interface, but not the Serializable interface. And yet, somehow, the registry, which is running in a different JVM, is sent something.
What the registry actually gets is a stub. The stub for Account_Impl, which was automatically generated by rmic, begins with:
public final class Account_Impl_Stub extends java.rmi.server.RemoteStub
java.rmi.server.RemoteStub is a class that implements the Serializable interface. The RMI serialization mechanism knows that whenever a remote server is "sent" over the wire, the server object should be replaced by a stub that knows how to communicate with the server (e.g., a stub that knows on which machine and port the server is listening).
Calling Naming.rebind( ) actually winds up passing a stub to the RMI registry. When clients make calls to Naming.lookup( ), as in the following code snippet, they also receive copies of the stub. Since the stub is serializable, there's no problem in making a copy of it:
_account = (Account)Naming.lookup(_accountNameField.getText(  ));
In order to enable this behavior, ObjectOutputStream calls enableReplaceObject( ) and replaceObject( ) during the serialization process. In other words, when an instance is about to be serialized, ObjectOutputStream does the following:
  1. It calls enableReplaceObject( ) to see whether instance replacement is enabled.
  2. If instance replacement is enabled, it calls replaceObject( ), passing in the instance it was about to serialize, to find out which instance it should really write to the stream.
  3. It then writes the appropriate instance to the stream.










Maintaining Direct Connections

A question that frequently arises as distributed applications get more complicated involves message forwarding. For example, suppose that we have three communicating programs: A, B, and C. At the start, A has a stub for B, B has a stub for C, and C has a stub for A. See Figure 10-6.
Figure 10-6. Communication between three applications
 

Now, what happens if A calls a method, for example, getOtherServer( ), on B that "returns" C? The answer is that A gets a deep copy of the stub B uses to communicate with C. That is, A now has a direct connection to C; whenever A tries to send a message to C, B is not involved at all. This is illustrated in Figure 10-7.
Figure 10-7. Improved communication between three applications
 

This is very good from a bandwidth and network latency point of view. But it can also be somewhat problematic. Suppose, for example, B implements load balancing. Since B isn't involved in the A to C communication, it has no direct way of knowing whether A is still using C, or how heavily. We'll revisit this in Chapters and , when we discuss the distributed garbage collector and the Unreferenced interface.

Versioning Classes

A few pages back, I described the serialization mechanism:
The serialization mechanism automatically, at runtime, converts class objects into metadata so instances can be serialized with the least amount of programmer work.
This is great as long as the classes don't change. When classes change, the metadata, which was created from obsolete class objects, accurately describes the serialized information. But it might not correspond to the current class implementations.

The Two Types of Versioning Problems

There are two basic types of versioning problems that can occur. The first occurs when a change is made to the class hierarchy (e.g., a superclass is added or removed). Suppose, for example, a personnel application made use of two serializable classes: Employee and Manager (a subclass of Employee). For the next version of the application, two more classes need to be added: Contractor and Consultant. After careful thought, the new hierarchy is based on the abstract superclass Person, which has two direct subclasses: Employee and Contractor. Consultant is defined as a subclass of Contractor, and Manager is a subclass of Employee. See Figure 10-8.
Figure 10-8. Changing the class hierarchy
 

While introducing Person is probably good object-oriented design, it breaks serialization. Recall that serialization relied on the class hierarchy to define the data format.
The second type of version problem arises from local changes to a serializable class. Suppose, for example, that in our bank example, we want to add the possibility of handling different currencies. To do so, we define a new class, Currency, and change the definition of Money:
public class Money extends ValueObject { public float amount; public Currency typeOfMoney;}
This completely changes the definition of Money but doesn't change the object hierarchy at all.
The important distinction between the two types of versioning problems is that the first type can't really be repaired. If you have old data lying around that was serialized using an older class hierarchy, and you need to use that data, your best option is probably something along the lines of the following:
  1. Using the old class definitions, write an application that deserializes the data into instances and writes the instance data out in a neutral format, say as tab-delimited columns of text.
  2. Using the new class definitions, write a program that reads in the neutral-format data, creates instances of the new classes, and serializes these new instances.





The second type of versioning problem, on the other hand, can be handled locally, within the class definition.

How Serialization Detects When a Class Has Changed

In order for serialization to gracefully detect when a versioning problem has occurred, it needs to be able to detect when a class has changed. As with all the other aspects of serialization, there is a default way that serialization does this. And there is a way for you to override the default.
The default involves a hashcode. Serialization creates a single hashcode, of type long, from the following information:
  • The class name and modifiers
  • The names of any interfaces the class implements
  • Descriptions of all methods and constructors except private methods and constructors
  • Descriptions of all fields except private, static, and private transient















This single long, called the class's stream unique identifier (often abbreviated suid), is used to detect when a class changes. It is an extraordinarily sensitive index. For example, suppose we add the following method to Money:
public boolean isBigBucks(  ) {     return _cents > 5000;}
We haven't changed, added, or removed any fields; we've simply added a method with no side effects at all. But adding this method changes the suid. Prior to adding it, the suid was 6625436957363978372L; afterwards, it was -3144267589449789474L. Moreover, if we had made isBigBucks( ) a protected method, the suid would have been 4747443272709729176L.
TIP:   These numbers can be computed using the serialVer program that ships with the JDK. For example, these were all computed by typing serialVer com.ora.rmibook.chapter10.Money at the command line for slightly different versions of the Money class.
The default behavior for the serialization mechanism is a classic "better safe than sorry" strategy. The serialization mechanism uses the suid, which defaults to an extremely sensitive index, to tell when a class has changed. If so, the serialization mechanism refuses to create instances of the new class using data that was serialized with the old classes.

Implementing Your Own Versioning Scheme

While this is reasonable as a default strategy, it would be painful if serialization didn't provide a way to override the default behavior. Fortunately, it does. Serialization uses only the default suid if a class definition doesn't provide one. That is, if a class definition includes a static final long named serialVersionUID, then serialization will use that static final long value as the suid. In the case of our Money example, if we included the line:
private static final long serialVersionUID = 1;
in our source code, then the suid would be 1, no matter how many changes we made to the rest of the class. Explicitly declaring serialVersionUID allows us to change the class, and add convenience methods such as isBigBucks( ), without losing backwards compatibility.
TIP:   serialVersionUID doesn't have to be private. However, it must be static, final, and long.
The downside to using serialVersionUID is that, if a significant change is made (for example, if a field is added to the class definition), the suid will not reflect this difference. This means that the deserialization code might not detect an incompatible version of a class. Again, using Money as an example, suppose we had:
public class Money extends ValueObject { private static final long serialVersionUID = 1; protected int _cents;
and we migrated to:
public class Money extends ValueObject { private static final long serialVersionUID = 1; public float amount; public Currency typeOfMoney;}
The serialization mechanism won't detect that these are completely incompatible classes. Instead, when it tries to create the new instance, it will throw away all the data it reads in. Recall that, as part of the metadata, the serialization algorithm records the name and type of each field. Since it can't find the fields during deserialization, it simply discards the information.
The solution to this problem is to implement your own versioning inside of readObject( ) and writeObject( ). The first line in your writeObject( ) method should begin:
private void writeObject(java.io.ObjectOutputStream out) throws IOException { stream.writeInt(VERSION_NUMBER); ....}
In addition, your readObject( ) code should start with a switch statement based on the version number:
private void readObject(java.io.ObjectInputStream in) throws IOException, 
    ClassNotFoundException { int version = in.readInt(  ); switch(version) { // version specific demarshalling code.  ....}}
Doing this will enable you to explicitly control the versioning of your class. In addition to the added control you gain over the serialization process, there is an important consequence you ought to consider before doing this. As soon as you start to explicitly version your classes, defaultWriteObject( ) and defaultReadObject( ) lose a lot of their usefulness.
Trying to control versioning puts you in the position of explicitly writing all the marshalling and demarshalling code. This is a trade-off you might not want to make.

Performance Issues

Serialization is a generic marshalling and demarshalling algorithm, with many hooks for customization. As an experienced programmer, you should be skeptical--generic algorithms with many hooks for customization tend to be slow. Serialization is not an exception to this rule. It is, at times, both slow and bandwidth-intensive. There are three main performance problems with serialization: it depends on reflection, it has an incredibly verbose data format, and it is very easy to send more data than is required.

Serialization Depends on Reflection

The dependence on reflection is the hardest of these to eliminate. Both serializing and deserializing require the serialization mechanism to discover information about the instance it is serializing. At a minimum, the serialization algorithm needs to find out things such as the value of serialVersionUID, whether writeObject( ) is implemented, and what the superclass structure is. What's more, using the default serialization mechanism, (or calling defaultWriteObject( ) from within writeObject( )) will use reflection to discover all the field values. This can be quite slow.
TIP:   Setting serialVersionUID is a simple, and often surprisingly noticeable, performance improvement. If you don't set serialVersionUID, the serialization mechanism has to compute it. This involves going through all the fields and methods and computing a hash. If you set serialVersionUID, on the other hand, the serialization mechanism simply looks up a single value.

Serialization Has a Verbose Data Format

Serialization's data format has two problems. The first is all the class description information included in the stream. To send a single instance of Money, we need to send all of the following:
  • The description of the ValueObject class
  • The description of the Money class
  • The instance data associated with the specific instance of Money.










This isn't a lot of information, but it's information that RMI computes and sends with every method invocation.[7] Even if the first two bullets comprise only 100 extra bytes of information, the cumulative impact is probably significant.
The second problem is that each serialized instance is treated as an individual unit. If we are sending large numbers of instances within a single method invocation, then there is a fairly good chance that we could compress the data by noticing commonalities across the instances being sent.

It Is Easy to Send More Data Than Is Required

Serialization is a recursive algorithm. You pass in a single object, and all the objects that can be reached from that object by following instance variables, are also serialized. To see why this can cause problems, suppose we have a simple application that uses the Employee class:
public class Employee implements Serializable { public String firstName; public String lastName; Public String socialSecurityNumber;}
In a later version of the application, someone adds a new piece of functionality. As part of doing so, they add a single additional field to Employee:
public class Employee implements Serializable { public String firstName; public String lastName; Public String socialSecurityNumber; Public Employee manager;}
What happens as a result of this? On the bright side, the application still works. After everything is recompiled, the entire application, including the remote method invocations, will still work. That's the nice aspect of serialization--we added new fields, and the data format used to send arguments over the wire automatically adapted to handle our changes. We didn't have to do any work at all.
On the other hand, adding a new field redefined the data format associated with Employee. Because serialVersionUID wasn't defined in the first version of the class, none of the old data can be read back in anymore. And there's an even more serious problem: we've just dramatically increased the bandwidth required by remote method calls.
Suppose Bob works in the mailroom. And we serialize the object associated with Bob. In the old version of our application, the data for serialization consisted of:
  • The class information for Employee
  • The instance data for Bob





In the new version, we send:
  • The class information for Employee
  • The instance data for Bob
  • The instance data for Sally (who runs the mailroom and is Bob's manager)
  • The instance data for Henry (who is in charge of building facilities)
  • The instance data for Alison (Director, Corporate Infrastructure)
  • The instance data for Mary (VP in charge of IT)















And so on...
The new version of the application isn't backwards-compatible because our old data can't be read by the new version of the application. In addition, it's slower and is much more likely to cause network congestion.

The Externalizable Interface

To solve the performance problems associated with making a class Serializable, the serialization mechanism allows you to declare that a class is Externalizable instead. When ObjectOutputStream's writeObject( ) method is called, it performs the following sequence of actions:
  1. It tests to see if the object is an instance of Externalizable. If so, it uses externalization to marshall the object.
  2. If the object isn't an instance of Externalizable, it tests to see whether the object is an instance of Serializable. If so, it uses serialization to marshall the object.
  3. If neither of these two cases apply, an exception is thrown.










Externalizable is an interface that consists of two methods:
public void readExternal(ObjectInput in); public void writeExternal(ObjectOutput out); 
These have roughly the same role that readObject( ) and writeObject( ) have for serialization. There are, however, some very important differences. The first, and most obvious, is that readExternal( ) and writeExternal( ) are part of the Externalizable interface. An object cannot be declared to be Externalizable without implementing these methods.
However, the major difference lies in how these methods are used. The serialization mechanism always writes out class descriptions of all the serializable superclasses. And it always writes out the information associated with the instance when viewed as an instance of each individual superclasses.
Externalization gets rid of some of this. It writes out the identity of the class (which boils down to the name of the class and the appropriate serialVersionUID). It also stores the superclass structure and all the information about the class hierarchy. But instead of visiting each superclass and using that superclass to store some of the state information, it simply calls writeExternal( ) on the local class definition. In a nutshell: it stores all the metadata, but writes out only the local instance information.
TIP:   This is true even if the superclass implements Serializable. The metadata about the class structure will be written to the stream, but the serialization mechanism will not be invoked. This can be useful if, for some reason, you want to avoid using serialization with the superclass. For example, some of the Swing classes,[8] while they claim to implement Serializable, do so incorrectly (and will throw exceptions during the serialization process). If you really need to use these classes, and you think serialization would be useful, you may want to think about creating a subclass and declaring it to be Externalizable. Instances of your class will be written out and read in using externalization. Because the superclass is never serialized or deserialized, the incorrect code is never invoked, and the exceptions are never thrown.

Comparing Externalizable to Serializable

Of course, this efficiency comes at a price. Serializable can be frequently implemented by doing two things: declaring that a class implements the Serializable interface and adding a zero-argument constructor to the class. Furthermore, as an application evolves, the serialization mechanism automatically adapts. Because the metadata is automatically extracted from the class definitions, application programmers often don't have to do anything except recompile the program.
On the other hand, Externalizable isn't particularly easy to do, isn't very flexible, and requires you to rewrite your marshalling and demarshalling code whenever you change your class definitions. However, because it eliminates almost all the reflective calls used by the serialization mechanism and gives you complete control over the marshalling and demarshalling algorithms, it can result in dramatic performance improvements.
To demonstrate this, I have defined the EfficientMoney class. It has the same fields and functionality as Money but implements Externalizable instead of Serializable:
public class EfficientMoney extends ValueObject implements Externalizable { public static final long serialVersionUID = 1; protected int _cents;  public EfficientMoney(Integer cents) {  this(cents.intValue(  )); }  public EfficientMoney(int cents) {  super(cents + " cents.");  _cents = cents; }  public void readExternal(ObjectInput in) throws IOException,
         ClassNotFoundException {  _cents = in.readInt(  );  _stringifiedRepresentation = _cents + " cents."; }    public void writeExternal(ObjectOutput out) throws IOException {  out.writeInt(_cents); }}
We now want to compare Money with EfficientMoney. We'll do so using the following application:
public class MoneyWriter { public static void main(String[] args) {  writeOne(  );  writeMany(  ); }  private static void writeOne(  ) {  try {   System.out.println("Writing one instance");   Money money = new Money(1000);   writeObject("C:\\temp\\foo", money);  }  catch(Exception e){} }  private static void writeMany(  ) {  try {   System.out.println("Writing many instances");   ArrayList listOfMoney = new ArrayList(  );   for (int i=0; i<10000; i++) {    Money money = new Money(i*100);    listOfMoney.add(money);   }   writeObject("C:\\temp\\foo2", listOfMoney);  }   catch(Exception e){} }  private static void writeObject(String filename, Object object) throws 
         Exception {  FileOutputStream fileOutputStream = new FileOutputStream(filename);  ObjectOutputStream objectOutputStream = new             ObjectOutputStream(fileOutputStream);  long startTime = System.currentTimeMillis(  );  objectOutputStream.writeObject(object);  objectOutputStream.flush(  );  objectOutputStream.close(  );  System.out.println("Time: " + (System.currentTimeMillis(  ) - startTime)); }}
On my home machine, averaging over 10 trial runs for both Money and EfficientMoney, I get the results shown in Table 10-1. [9]
Table 10-1: Testing Money and EfficientMoney
Class
Number of instances
File size
Elapsed time
Money
1
266 bytes
60 milliseconds
Money
10,000
309 KB
995 milliseconds
EfficientMoney
1
199 bytes
50 milliseconds
EfficientMoney
10,000
130 KB
907 milliseconds
These results are fairly impressive. By simply converting a leaf class in our hierarchy to use externalization, I save 67 bytes and 10 milliseconds when serializing a single instance. In addition, as I pass larger data sets over the wire, I save more and more bandwidth--on average, 18 bytes per instance.
TIP:   Which numbers should we pay attention to? The single-instance costs or the 10,000-instance costs? For most applications, the single-instance cost is the most important one. A typical remote method call involves sending three or four arguments (usually of different types) and getting back a single return value. Since RMI clears the serialization mechanism between calls, a typical remote method call looks a lot more like serializing 3 or 4 single instances than serializing 10,000 instances of the same class.
If I need more efficiency, I can go further and remove ValueObject from the hierarchy entirely. The ReallyEfficientMoney class directly extends Object and implements Externalizable:
public class ReallyEfficientMoney implements Externalizable { public static final long serialVersionUID = 1; protected int _cents; protected String _stringifiedRepresentation;  public ReallyEfficientMoney(Integer cents) {  this(cents.intValue(  )); }  public ReallyEfficientMoney(int cents) {  _cents = cents;  _stringifiedRepresentation = _cents + " cents."; }  public void readExternal(ObjectInput in) throws IOException,         ClassNotFoundException {  _cents = in.readInt(  );  _stringifiedRepresentation = _cents + " cents."; }  public void writeExternal(ObjectOutput out) throws IOException {  out.writeInt(_cents); }}
ReallyEfficientMoney has much better performance than either Money or EfficientMoney when a single instance is serialized but is almost identical to EfficientMoney for large data sets. Again, averaging over 10 iterations, I record the numbers in Table 10-2.
Table 10-2: Testing ReallyEfficientMoney
Class
Number of instances
File size
Elapsed time
ReallyEfficientMoney
1
74 bytes
20 milliseconds
ReallyEfficientMoney
10,000
127 KB
927 milliseconds
Compared to Money, this is quite impressive; I've shaved almost 200 bytes of bandwidth and saved 40 milliseconds for the typical remote method call. The downside is that I've had to abandon my object hierarchy completely to do so; a significant percentage of the savings resulted from not including ValueObject in the inheritance chain. Removing superclasses makes code harder to maintain and forces programmers to implement the same method many times (ReallyEfficientMoney can't use ValueObject's implementation of equals( ) and hashCode( ) anymore). But it does lead to significant performance improvements.

One Final Point

An important point is that you can decide whether to implement Externalizable or Serializable on a class-by-class basis. Within the same application, some of your classes can be Serializable, and some can be Externalizable. This makes it easy to evolve your application in response to actual performance data and shifting requirements. The following two-part strategy is often quite nice:
  • Make all your classes implement Serializable.
  • After that, make some of them, the ones you send often and for which serialization is dramatically inefficient, implement Externalizable instead.





This gets you most of the convenience of serialization and lets you use Externalizable to optimize when appropriate.
Experience has shown that, over time, more and more objects will gradually come to directly extend Object and implement Externalizable. But that's fine. It simply means that the code was incrementally improved in response to performance problems when the application was deployed.



1. Just to be clear: doing things this way would be a bad idea (and this is not the way RMI passes instances over the wire).
2. RMI actually uses a subclass of ObjectOutputStream to customize its behavior.
3. The main exception is EJB containers that require earlier versions of Java. At this writing, for example, Oracle 8i's EJB container uses JDK 1.1.6.
4. If you don't know much about threads, just wait a few chapters and then revisit this example. It will make more sense then.
5. Though we'll come close.
6. Serialization is a slow process that uses the reflection API quite heavily in addition to the bandwidth.
7. Recall that RMI resets the serialization mechanism with every method call.
8. JTextArea is one of the most egregious offenders.
9. We need to average because the elapsed time can vary (it depends on what else the computer is doing). The size of the file is, of course, constant

Comments

Popular posts from this blog

Android Objective type Question and Answers

Android Questions and Answers for written exams

SCJP1.6 Question and Answers