Tagged: analysis

R & Weather Data

Overview

The weather has changed in Sacramento and now the daily lows are higher than some of the daily highs during winter.

Can I use R to determine how many days in the winter of 2012/2013 that the high temperature of the day was less than the low yesterday (59 degrees F)?

Results

Although I was unable to get the eample RJSONIO code in the first article to work, I was did sign up for an api key from weather underground and pull historical weather data using the weather underground example.

I was able to get the code in the second article to work, and no API Key was needed.

Article on Importing Weather Data into R

  • Title: Getting Historical Weather Data in R and SAP HANA
  • Posted by: Jitender Aswani
  • allthingsr on blogspot
    • weather underground api link
      • I signed up for a weather underground api key
        • [X] sign up for account
        • [X] select a plan for the api key
          • I selected the free “stratus” plan
          • https://i2.wp.com/icons.wxug.com/logos/images/wundergroundLogo_4c.jpg
    • [X] try code from article
      • [X] install.packages(“RJSONIO”)
        • [X] library(“RJSONIO”)
      • [X] install.packages(“rjson”)
        • may not be needed
      • [X] tried code, but fail due to invalid api key
        • [X] tried weather underground example
  • [X] Give up for now 2013.04.30 07:39
  • [ ] modify code for Sacramento (SAC)

Another Article on Importing Weather Data into R

Functions

wunder_station_daily <- function(station, date)
  {
  base_url <- 'http://www.wunderground.com/weatherstation/WXDailyHistory.asp?'

                                        # parse date
  m <- as.integer(format(date, '%m'))
  d <- as.integer(format(date, '%d'))
  y <- format(date, '%Y')

                                        # compose final url
  final_url <- paste(base_url,
  'ID=', station,
  '&month=', m,
  '&day=', d,
  '&year=', y,
  '&format=1', sep='')

                                        # reading in as raw lines from the web server
                                        # contains <br> tags on every other line
  u <- url(final_url)
  the_data <- readLines(u)
  close(u)

                                        # only keep records with more than 5 rows of data
  if(length(the_data) > 5 )
        {
                                        # remove the first and last lines
        the_data <- the_data[-c(1, length(the_data))]

                                        # remove odd numbers starting from 3 --> end
        the_data <- the_data[-seq(3, length(the_data), by=2)]

                                        # extract header and cleanup
        the_header <- the_data[1]
        the_header <- make.names(strsplit(the_header, ',')[[1]])

                                        # convert to CSV, without header
        tC <- textConnection(paste(the_data, collapse='\n'))
        the_data <- read.csv(tC, as.is=TRUE, row.names=NULL, header=FALSE, skip=1)
        close(tC)

                                        # remove the last column, created by trailing comma
        the_data <- the_data[, -ncol(the_data)]

                                        # assign column names
        names(the_data) <- the_header

                                        # convert Time column into properly encoded date time
        the_data$Time <- as.POSIXct(strptime(the_data$Time, format='%Y-%m-%d %H:%M:%S'))

                                        # remove UTC and software type columns
        the_data$DateUTC.br. <- NULL
        the_data$SoftwareType <- NULL

                                        # sort and fix rownames
        the_data <- the_data[order(the_data$Time), ]
        row.names(the_data) <- 1:nrow(the_data)

                                        # done
        return(the_data)
        }
  }

Pull Data

                                        # be sure to load the function from above first
                                        # get a single day's worth of (hourly) data
w <- wunder_station_daily('KCAANGEL4', as.Date('2011-05-05'))

                                        # get data for a range of dates
library(plyr)
date.range <- seq.Date(from=as.Date('2009-1-01'), to=as.Date('2011-05-06'), by='1 day')

                                        # pre-allocate list
l <- vector(mode='list', length=length(date.range))

                                        # loop over dates, and fetch data
for(i in seq_along(date.range))
  {
  print(date.range[i])
  l[[i]] <- wunder_station_daily('KCAANGEL4', date.range[i])
  }

                                        # stack elements of list into DF, 
                                        # filling missing columns with NA
d <- ldply(l)

                                        # save to CSV
write.csv(d, file=gzfile('KCAANGEL4.csv.gz'), row.names=FALSE)

Results

  • Worked fine, and no API Key required [2013.04.30 08:03]

Other Historical Weather Data Links

Advertisements

Logfile Analysis with R

In the organization I work at there is talk about using Splunk for logfile analysis. I suspect we could achive the same results with R or an open source alternative.

About R

R Logfile Analysis

Open Source Logfile Analyzers

Proprietary Logfile Analyzers

Data Deduplication

Even in these days of Big Data, Data Deduplication is even more needed. Common examples include contact list deduplication. Yahoo mail offers this, but it doesn’t handle two contacts that are identical except for one field. That additional field may not be even important to you or me.

I still haven’t found a contact deduplicator that I like, though bbdb does a pretty good job. Before I write something, I’m doing a survey of what is out there in the open source world, so here are some links: