2010-11-25

What are the Long Tail Effect and Streisand Effect? How are they related to Web 2.0? Give real-life examples how they take place in Web 2.0 age.




In this article, my focus will be on describing the Long Tail Effect and Streisand Effect on the Web nowadays and how they are related to Web 2.0.

Long Tail Effect

The term Long Tail or long tail refers to a phenomenon first described in a Wired article by Chris Anderson in 2004. It talks about how product offerings by a business, usually considered to be "unpopular" because of low sales volumes, can make up a significant portion of an online business because the total volume of the "unpopular" items can be significantly large. The term has become popular in recent times due to the raising trend of adopting the retailing strategy of selling a large number of hard-to-find items in relatively small quantities to customers especially on the World Wide Web. One of the reasons for this gain of popularity is the significant profit that can be gained by using this strategy compared to only selling fewer popular items in large quantities. The total sale of this large number of "non-hit items" is what constitutes the Long Tail Effect that we are talking about.

Long Tail Effect will only work when the cost of inventory storage and distribution is insignificant such that it becomes economically viable to sell relatively unpopular products. This can in turn increase the competitiveness of those unpopular goods and reduce the demand for most of the popular goods. For example, Web content businesses with broad coverage such as Yahoo! and Google may be threatened by the rise of smaller Web sites that focus on the fine details of Web contents and cover the topics better than the larger sites. This is made possible by the creation of easy and cheap Web site software and the spread of RSS which greatly reduce the cost of establishing and maintaining a Web site and the bother encountered by the readers to find these small Web sites on the Internet.

The Long Tail Effect is very much existed in the modern World Wide Web and it is related to Web 2.0 since many of the successful Internet businesses which contribute to the Web 2.0 have leveraged the Long Tail as part of their businesses. Some of the examples include eBay (auctions), Amazon (retail) and iTunes Store (music and podcasts), Yahoo! and Google (web search) which are amongst the major companies, along with smaller Internet companies like Audible (audio books) and Netflix (video rental) and etc.

Take Amazon and Netflix for instance which have made use of the Long Tail Effect since the introduction of Web 2.0 to acquire competitive edge over their neighborhood Blockbuster because of the low overhead cost to stock the exotic and unusal books / DVDs in large centralized warehouses. Contrast to this, Blockbuster has to spend lots of money every month for every square foot of space for their retail stores which are built in visible and thus expensive locations in your neighborhood. These costs are probably multiples of what a larger warehouse in the middle of nowhere costs. Ironically this does not guarantee raising popularity with such heavy spending on inventory storage and in fact Amazon and Netflix are more popular than Blockbuster nowadays due to the convenience provided by their online services which do not require customers to even walk out of their houses to buy their goods.

Streisand Effect

According to Wikipedia, the Streisand effect is primarily an online phenomenon in which an attempt to censor or remove a piece of information which in the end has the unintended consequence of causing the information to be publicized widely and to a greater extent than would have occurred if no censorship had been attempted. It is named after American entertainer Barbra Streisand following a 2003 incident in which her attempts to suppress the photographs of her residence inadvertently generated further publicity.

One of reasons which result in this Streisand effect is the increased accessibility due to Internet especially with the introduction of Web 2.0 technologies such as Tweets, Facebook, Youtube, Forum, BitTorrent and etc. that make exchange of information easier, faster and more open to the public than before. Whenever someone tries to suppress certain information, typically via lawsuits, the blogosphere may intentionally and easily spread the news which usually contains controversial information across the Internet or share with their friends. Ultimately, this may do even more damage to the complainant in terms of reputation than if he has just let the matter slide.

One of the famous real-life examples of Streisand effect that takes place in the Web 2.0 age is the Edison Chen photo scandal which shook the Hong Kong entertainment industry in early 2008. The photo scandal involved the illegal distribution over the Internet of intimate and private photographs of Hong Kong famous actor Edison Chen with various women that include actresses Gillian Chung, Cecilia Cheung, Bobo Chan and etc. The scandal was first started when someone had accidentally gained possession of these intimate photographs and posted them on the Internet to be shared with his friends. This had ultimately led to the distribution of those photographs across the Internet and the news was eventually spread to other Internet users through forwarding the images to the different forums in Hong Kong by the visitors.

The effort of Edison to suppress the distribution of his private photographs has ironically led to more photographs being put onto the Internet and more news coverage on this scandal that makes it one of the hottest topics in the history of Hong Kong entertainment industry. The rapid spread of the news about the scandal and the distribution of the photographs is made possible thanks to the technology provided in the Web 2.0 age such as forum which makes communication among people become much easier and almost in real time and BitTorrent / Foxy which makes the sharing of photographs across different computers on the Internet possible.

The Long Tail Effect and Streisand Effect do exist in today's Web 2.0 world and we all probably have taken part in it at one time or another. For Long Tail Effect, we should learn to take advantage of it especially for small business by for example considering Niche Marketing to maximize business' profit. For Streisand Effect, although it can be prevented by enforcing strict copyright protection on the online content, it does not guarantee total safety as it will be up to the judge or jury to decide what the law says and how it applies in the end. As a result, we should use our common sense when making speeches or posting online content explicitly or implicitly to prevent future confrontations.


References


2010-11-18

What is StAX? What are its advantages over SAX and DOM?


StAX is a standard XML API that can stream XML data from and to a particular application just like other streaming APIs such as SAX and XNI. The implementation of StAX is based on the standard pull parser interface for Java API. In StAX, the client application requests the parser for the next piece of information rather than the parser telling the client application when the next datum is available. To put it simple, in StAX, the client application will drive the parser instead of the opposite. Moreover, StAX shares with SAX the ability to read arbitrarily large documents and it is a bidirectional API which allows applications to both read existing XML documents and create new ones at the same time.

One of the features of StAX is that it provides efficient XML access using cursor API such as XMLStreamReader. This cursor API will move across an XML document from the beginning to the end, one item at a time and only in a forward motion. The item that the cursor will point at can be a text node, a start-tag, a comment or the beginning of the document. One can retrieve the information about the item which the cursor is currently positioned at by invoking methods such as getName and getText on the XMLStreamReader. Below shows an example to demonstrate how an instance of XMLStreamReader is loaded using the XMLInputFactory java class in a typical StAX program:

URL u = new URL("http://rom1023.blogspot.com/");
InputStream in = u.openStream();
XMLInputFactory factory = XMLInputFactory.newInstance();
XMLStreamReader parser = factory.createXMLStreamReader(in);

There are many getter methods available on the XMLStreamReader for retrieving different types of information from the current item. For example, we can retrieve the name of the element, its text node, its attribute count and etc. for the current item that the cursor is pointing at. Here is a sample code that will iterate through the XML document and print out the names of the different elements the cursor has encountered and the content of the characters if a character event is met:

for (int event = parser.next(); event != XMLStreamConstants.END_DOCUMENT; event = parser.next()) {
    switch (event) {
        case XMLStreamConstants.START_ELEMENT:
            System.out.println(parser.getLocalName());
        case XMLStreamConstants.CHARACTERS:
            System.out.println(parser.getText());
    }
}

The above loop with a switch statement is a very common pattern used in StAX programs instead of using a stack of if-else statements. However, this is also one of major criticisms of StAX as the Integer type codes for determining the type of item the cursor is at and the big switch statements do not align with the pattern of Object oriented programs which are based on classes, inheritance and polymorphism. Instead, the next method of the XMLStreamReader class should return an XMLEvent object that has subclasses like StartElement, Characters, EndDocument and etc. in order to implement the Object oriented concepts. The main reason for using integer type codes instead of classes is to avoid the slow reflection of Java API but it does sacrifice the advantages of using Object oriented programming.

The above simple example alone perhaps doesn't demonstrate the full power of StAX. As what I have mentioned in the previous paragraph, StAX is a bidirectional API which allows reading of XML document and also writing data into XML document as well. For output, instead of using the XMLStreamReader class which I have introduced earlier on, we can use the XMLStreamWriter which is an interface class that provides methods to write elements, attributes, comments, text and etc. to an XML document. Below is an example of how an instance of XMLStreamWriter can be loaded using the XMLOutputFactory java class:

OutputStream out = new FileOutputStream("data.xml");
XMLOutputFactory factory = XMLOutputFactory.newInstance();
XMLStreamWriter writer = factory.createXMLStreamWriter(out);

Different types of data can be written onto the output stream using various writeFOO methods provided by the XMLStreamWriter class such as writeStartDocument, writeStartElement, writeEndElement, writeCharacters, writeComment and etc. Below is an example of writing a hello world XML document:

writer.writeStartDocument("ISO-8859-1", "1.0");
writer.writeStartElement("greeting");
writer.writeCharacters("Hello World");
writer.writeEndDocument();

There are many advantages of using XMLStreamWriter to write data to an XML document. One of these is that it helps to maintain that some constraints are well-formed. For example, the endDocument method will close all the unclosed start-tags and writeCharacters will perform any necessary escaping of special characters such as < and &. Moreover, it is able to deal with documents with multiple roots and namespaces and also element names that contain whitespace. Overall, creating XML document using XMLStreamWriter is more efficient and faster than using a DOM tree.

So, after introducing what is StAX, the big question here is that what are its advantages over SAX and DOM? Firstly, start with SAX, one of the major differences between StAX and SAX APIs is that StAX is a streaming pull parser whilst SAX is a streaming push parser. A streaming pull parser refers to a programming model in which a client application will do the initiative of calling methods on a XML parsing library when it needs to interact with an XML document. That is, the client application only gets (pulls) the XML data when it explicitly asks for it. On the other hand, a streaming push parser refers to a programming model in which an XML parser sends (pushes) the XML data to the client application as the parser moves across an XML document from one element to the other. That is, the parser sends the data to the client application without considering whether or not it is at a ready state for it to use.

The advantages of StAX as a streaming pull parser over SAX which is a streaming push parser are summarized as below:
  • StAX parser is more flexible as it allows the client to control the application thread and call methods on the parser when needed. By contrast, SAX will take the control of the application thread and the client can only accept invocations from the parser.
  • The size of the pull parsing libraries of StAX is much smaller than that of SAX and the implementation codes of StAX are much simpler and easier to code than that of SAX especially even when dealing with more complex XML documents.
  • A StAX pull parser is capable of reading multiple documents at one time with a single thread and this can't be done with SAX parser.
  • A StAX pull parser can filter elements that can be ignored by the client in an XML document and it can support XML views of non-XML data. A SAX push parser, on the other hand, does not support such functions.
  • Unlike SAX, StAX is a bidirectional API which allows programs to both read existing XML documents and create new ones. This gives StAX an edge over SAX by providing user with more functions and alternatives.
Next, let's talk about the advantages of StAX over DOM. Generally speaking, there are two types of programming models for working with XML documents: document streaming (SAX and StAX) and the document model (DOM). Streaming models for XML processing are particularly useful when there is limitation of memory usage in the application or when the application has to process several requests simultaneously. In fact, it can be argued that majority of the XML business logic can benefit more from the streaming processing style than the DOM-tree processing style which demands in-memory maintenance of the entire DOM trees.

To summarize, here are the advantages of StAX as a document streaming model over DOM which uses the document tree model:
  • StAX works better than DOM when processing a large XML document which is larger than a few megabytes in size or in memory constrained environments such as J2ME.
  • StAX API is faster than DOM API in general as they can start generating output from the input almost immediately without waiting for the entire document to be read which is not the case for DOM which needs to build excessively complicated tree data structure upon reading the document.
  • StAX API is able to work on applications that require a constant streaming of XML document to retrieve the real-time data such as Web Services or Instant Messaging applications. DOM API is impossible to work on these applications as it will be inappropriate to wait for the stream’s closing tag (in order to finalize the building of the DOM tree) since the XML document is consistently streaming.
In conclusion, StAX is a fast, straightforward and memory-thrifty way of loading data from an XML document. Although it still have its shortcomings such as it does not support random access of the XML document after loading and it does not work well when the structure of the XML document is very complex, many of the toughest XML processing problems encountered today do come from exactly the domain where StAX does work well in compared to SAX and DOM.


References

2010-11-17

What is "session hijacking"? What are its security threats? How can web developers avoid it?



Session hijacking is the act of exploiting a valid user session and gaining unauthorized access to information or services after successfully obtaining or generating an authentication session ID. Session ID and information that are unique to a particular user's Web application session are usually saved in the HTTP cookies of the user's computer and these HTTP cookies can be easily stolen by an attacker using an intermediary computer while the session is still in progress. Thus, this makes it unsafe for applications which use the stateless HTTP to store the session parameters that are relevant to the user when the user logs into the application. Before tackling the problem of session hijacking, we should first get to know the usage of session in a Web application, how session hijacking can be carried out and what its security threats are which will be discussed in the following paragraphs.

A session is a succession of interactions between two communication end points (usually between the client's computer and the Web server) that occurs during the span of a single connection. When a user logs into an application and passes through the authentication, a session is created on the server to maintain the state for other requests originating from the same user. Applications use sessions to store parameters which are specific to the user in a particular login session. The session is kept "alive" on the server as long as the user is logged on to the system and the session is only destroyed when the user logs out of the system or the session is timed out after a predefined period of inactivity. The application parameters for the user will be deleted from the allocated memory space once the session is destroyed.

Sessions are independent of each other and are uniquely identified by a session ID which is usually a long, random and alpha-numeric identification string that is transmitted between the client's computer and the server. Session IDs are commonly stored in the HTTP cookies, URLs and the hidden fields of HTML Web pages. A URL containing the session ID will usually look something like this:

http://rom1023.blogspot.com/view.do?sessionID=7AD30725112120802

In a HTML Web page, a session ID may be stored as a hidden field of a form:

<input type="hidden" name="sessionID" value="7AD30725112120802">

There are problems with the session ID. If the algorithms used to generate the session ID are based on easily predictable variables such as time or IP address and that encryption is not used (typically SSL) during transmission of session ID, the session ID may be susceptible to stealing by attackers which may then lead to session hijacking.

There are many methods for hijacking sessions:
  • Session fixation – the attacker tricks the user into using a session ID that is known to him by for example sending the user an email with a link that contains this session ID and wait for the user to log into this particular session.
  • Session sidejacking – the attacker uses packet sniffing to read network traffic between two parties to steal the session cookie especially in situation when no encryption is used in the web site to prevent attackers from viewing the session data.
  • Cross-site scripting – the attacker tricks the user's computer into executing a malicious script or code which is treated as trustworthy as it appears to belong to the server to redirect the private user's information including cookies to the attacker.
The above are just part of the many methods that can be used to hijack sessions in the modern world. Pathetically, it just demonstrates how a session can be easily hijacked even without the need to be a skillful attacker. As a result, it is important that we examine the security threats of session hijacking and develop methods and preventions to protect against the session attack by the hackers.

Some of the security threats of session hijacking are that attacker may easily gain complete access to the private data of the user whose session has been hijacked and is permitted to perform operations on behalf of the user which may be extremely harmful especially if the system deals with money. The hijacked user may suffer from financial loss if the attacker is able to perform operation such as money transfer or use the gained private data for other illegal purposes. Also, the company which supports the hijacked system may be accounted for the security bridge and required to pay a large sum of compensation.

Moreover, the system may suffer from service attack if the attacker gains authorized access to the system and starts bombarding the server with requests to consume all available system resources or by passing malformed input data that can crash an application process. Elevation of privilege may also occur during session hijacking when the attacker assumes the identity of a privileged user to gain access to a highly privileged and trusted process or account. As a result, all these unwanted security threats that may arise due to session hijacking must be prevented in order to provide a secure and trustworthy application for users to use.

There are many methods to prevent session hijacking which can be followed by web developers. Some of these methods are summarized as below:
  • Use a long, random and unpredictable alphanumeric string as the session ID as this can reduce the risk that an attack can guess a valid session ID by trial and error or brute force attacks.
  • Regenerate the session ID after a successful login to prevent session fixation since the attacker will not know the session ID of the user after logged into the system.
  • Encrypt the important data such as the session ID passed between the client's computer and the server on the network by using technology such as SSL to prevent sniffing style of attacks.
  • Make secondary check against the identity of the user for each request made on the server. The check can be based on investigating the consistency of the IP address of the user between the current and the previous requests and prevent the user from browsing if the IP address does not match with the previous one. The drawback of this method is that it does not prevent attacks by somebody who shares the same IP address.
  • Change the value in the HTTP cookies with each and every request to prevent the attackers from retrieving the session data through the HTTP cookies. However, this may not be applicable for some applications which depend on the session parameters that passed between pages.
  • Expire the session as soon as the user logs out and clear any session parameters that may have stored in the HTTP cookies.
  • Set timeout and reduce the life span of a session or a cookie to prevent accumulation of outdated session data in the HTTP cookies which may of interest to the attackers.
  • Use a third party software such as ArpON to guard against possible hijacking on the website.
All in all, although the above methods may not be possible to guard against all forms of session hijacking, they do provide some guidelines for Web developers to produce a securer system and reduce the security threats that may arise due to session hijacking. Prevention is better than cure and it will be no harm to follow the above methods especially if it does not require any dramatic change in your system. At least, it won't too late if a session hijacking does occur in your system suddenly.


References