# Read Data using RDD API

The datasaet contains all HTTP requests from August 1995 to the NASA Kennedy Space Center WWW server in Florida.

The dataset is publicly availabe at: http://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html

It is stored in HDFS as a text file at: `/data/nasa_log_aug95`

In [1]:
logs_rdd = sc.textFile('/data/nasa_log_aug95')
logs_rdd.take(10)

['in24.inetnebr.com - - [01/Aug/1995:00:00:01 -0400] "GET /shuttle/missions/sts-68/news/sts-68-mcc-05.txt HTTP/1.0" 200 1839',
 'uplherc.upl.com - - [01/Aug/1995:00:00:07 -0400] "GET / HTTP/1.0" 304 0',
 'uplherc.upl.com - - [01/Aug/1995:00:00:08 -0400] "GET /images/ksclogo-medium.gif HTTP/1.0" 304 0',
 'uplherc.upl.com - - [01/Aug/1995:00:00:08 -0400] "GET /images/MOSAIC-logosmall.gif HTTP/1.0" 304 0',
 'uplherc.upl.com - - [01/Aug/1995:00:00:08 -0400] "GET /images/USA-logosmall.gif HTTP/1.0" 304 0',
 'ix-esc-ca2-07.ix.netcom.com - - [01/Aug/1995:00:00:09 -0400] "GET /images/launch-logo.gif HTTP/1.0" 200 1713',
 'uplherc.upl.com - - [01/Aug/1995:00:00:10 -0400] "GET /images/WORLD-logosmall.gif HTTP/1.0" 304 0',
 'slppp6.intermind.net - - [01/Aug/1995:00:00:10 -0400] "GET /history/skylab/skylab.html HTTP/1.0" 200 1687',
 'piweba4y.prodigy.com - - [01/Aug/1995:00:00:10 -0400] "GET /images/launchmedium.gif HTTP/1.0" 200 11853',
 'slppp6.intermind.net - - [01/Aug/1995:00:00:11 -0400] "GET

# Data Cleaning

The webserver log file is in the __common logfile format__ (https://www.w3.org/Daemon/User/Config/Logging.html#common-logfile-format).

Each line is structred as follows:

`remotehost rfc931 authuser [date] "request" status bytes`

In [2]:
import re

p = re.compile(r'([^ ]*) ([^ ]*) ([^ ]*) \[([^]]*)\] "(.*)" ([^ ]*) ([^ ]*)')

def split_logline(line):   
    matches = p.match(line)
    host = matches.group(1)
    date = matches.group(4)
    request = matches.group(5)
    status = int(matches.group(6))
    try:
        response_size = int(matches.group(7)) 
    except ValueError:
        response_size = 0
    return (host, date, request, status, response_size)

split_rdd = logs_rdd.map(split_logline)
split_rdd.take(10)

[('in24.inetnebr.com',
  '01/Aug/1995:00:00:01 -0400',
  'GET /shuttle/missions/sts-68/news/sts-68-mcc-05.txt HTTP/1.0',
  200,
  1839),
 ('uplherc.upl.com', '01/Aug/1995:00:00:07 -0400', 'GET / HTTP/1.0', 304, 0),
 ('uplherc.upl.com',
  '01/Aug/1995:00:00:08 -0400',
  'GET /images/ksclogo-medium.gif HTTP/1.0',
  304,
  0),
 ('uplherc.upl.com',
  '01/Aug/1995:00:00:08 -0400',
  'GET /images/MOSAIC-logosmall.gif HTTP/1.0',
  304,
  0),
 ('uplherc.upl.com',
  '01/Aug/1995:00:00:08 -0400',
  'GET /images/USA-logosmall.gif HTTP/1.0',
  304,
  0),
 ('ix-esc-ca2-07.ix.netcom.com',
  '01/Aug/1995:00:00:09 -0400',
  'GET /images/launch-logo.gif HTTP/1.0',
  200,
  1713),
 ('uplherc.upl.com',
  '01/Aug/1995:00:00:10 -0400',
  'GET /images/WORLD-logosmall.gif HTTP/1.0',
  304,
  0),
 ('slppp6.intermind.net',
  '01/Aug/1995:00:00:10 -0400',
  'GET /history/skylab/skylab.html HTTP/1.0',
  200,
  1687),
 ('piweba4y.prodigy.com',
  '01/Aug/1995:00:00:10 -0400',
  'GET /images/launchmedium.gif HTTP/1

## Let's extract the resource path from the request field.

In [3]:
def split_request(record):
    request = record[2]
    parts = request.split()
    if len(parts) > 1:
        resource = parts[1]
    else:
        resource = ''
    
    return record + (resource,)

resource_rdd = split_rdd.map(split_request)
resource_rdd.cache()
resource_rdd.take(10)

[('in24.inetnebr.com',
  '01/Aug/1995:00:00:01 -0400',
  'GET /shuttle/missions/sts-68/news/sts-68-mcc-05.txt HTTP/1.0',
  200,
  1839,
  '/shuttle/missions/sts-68/news/sts-68-mcc-05.txt'),
 ('uplherc.upl.com',
  '01/Aug/1995:00:00:07 -0400',
  'GET / HTTP/1.0',
  304,
  0,
  '/'),
 ('uplherc.upl.com',
  '01/Aug/1995:00:00:08 -0400',
  'GET /images/ksclogo-medium.gif HTTP/1.0',
  304,
  0,
  '/images/ksclogo-medium.gif'),
 ('uplherc.upl.com',
  '01/Aug/1995:00:00:08 -0400',
  'GET /images/MOSAIC-logosmall.gif HTTP/1.0',
  304,
  0,
  '/images/MOSAIC-logosmall.gif'),
 ('uplherc.upl.com',
  '01/Aug/1995:00:00:08 -0400',
  'GET /images/USA-logosmall.gif HTTP/1.0',
  304,
  0,
  '/images/USA-logosmall.gif'),
 ('ix-esc-ca2-07.ix.netcom.com',
  '01/Aug/1995:00:00:09 -0400',
  'GET /images/launch-logo.gif HTTP/1.0',
  200,
  1713,
  '/images/launch-logo.gif'),
 ('uplherc.upl.com',
  '01/Aug/1995:00:00:10 -0400',
  'GET /images/WORLD-logosmall.gif HTTP/1.0',
  304,
  0,
  '/images/WORLD-logosm

# EDA

## How many requests are there in total?

In [4]:
resource_rdd.count()

1569898

## Print descriptive statistics for the content size

In [5]:
resource_rdd.map(lambda record: record[4]).stats()

(count: 1569898, mean: 17089.225812122706, stdev: 67954.742278517, max: 3421948.0, min: 0.0)

## How many request are there for each status code?

* __200 OK__: Standard response for successful HTTP requests.
* __302 Found__: Tells the client to look at (browse to) another url.
* __304 Not Modified__: Indicates that the resource has not been modified; there is no need to retransmit the resource since the client still has a previously-downloaded copy.
* __400 Bad Request__: The server cannot or will not process the request due to an apparent client error.
* __403 Forbidden__: The request was valid, but the server is refusing action. The user might not have the necessary permissions for a resource, or may need an account of some sort.
* __404 Not Found__: The requested resource could not be found.
* __500 Internal Server Error__: A generic error message, given when an unexpected condition was encountered.
* __501 Not Implemented__: The server either does not recognize the request method, or it lacks the ability to fulfil the request.

In [6]:
status_rdd = resource_rdd.map(lambda record: (record[3], 1)) \
    .reduceByKey(lambda s1, s2: s1 + s2)
status_rdd.sortByKey().collect()

[(200, 1398988),
 (302, 26497),
 (304, 134146),
 (400, 10),
 (403, 171),
 (404, 10056),
 (500, 3),
 (501, 27)]

## Which HTML pages have been accessed most frequently?

In [7]:
resource_rdd.filter(lambda record: record[5].endswith('.html')) \
    .map(lambda record: (record[5], 1)) \
    .reduceByKey(lambda r1, r2: r1 + r2) \
    .sortBy(lambda pair: pair[1], ascending=False) \
    .take(10)

[('/ksc.html', 43687),
 ('/shuttle/missions/sts-69/mission-sts-69.html', 24606),
 ('/shuttle/missions/missions.html', 22453),
 ('/software/winvn/winvn.html', 10345),
 ('/history/history.html', 10134),
 ('/history/apollo/apollo.html', 8985),
 ('/shuttle/countdown/liftoff.html', 7865),
 ('/history/apollo/apollo-13/apollo-13.html', 7177),
 ('/shuttle/technology/sts-newsref/stsref-toc.html', 6517),
 ('/shuttle/missions/sts-69/images/images.html', 5264)]