PERL

Q1:
Consider the following program:

  #! /usr/local/bin/perl5.003 -w

  {
        my($string) = "brad<hello>3hello";
        $string =~ /^[^\d]{2,4}<([^>]+)>\d?\1$/;
        if( defined($1) ) {
                print "$1\n";
        } else {
                print "not found\n";
        }
  }

Explain what the regular expression is trying to match?


A1:
The regular expression is attempting to match between 2 and 4 non digit
numbers and then anything between two <angle> brackets, an optional
number and then the word between the angle brackets again.

Thus the following inputs would produce these outputs :

brad<foo>3bar     = not found
brad<foo>3foo     = foo
brad<foo>foo      = foo
brd<foo>4foo      = foo
brd<foo>44foo     = not found
br<4foo>44foo     = 4foo


Q2:
If you were writing a perl program as a prototype for
a program you would like to eventually write in C (not C++),
would this alter how you write the Perl prototype? In what ways? 


A2:
To do this I would write the program in a strictly non object orientated
style, making sure that I explicitly put in 'destructors' where
necessary (by setting a variable to undef)  even though Perl does not
require them. I would attempt to make sure that types were as strict as
possible and where they were converted (the number 2 to the string "2"
for example) these would be explicitly documented.

Using Perl idiosyncrasies and special features such as foreach loops and 
the unless statement would only provide complications when the porting 
from Perl to C. Things likes regular expressions and CPAN modules 
should be avoided unless there was a satisfactory equivalent available 
to C through a similar interface.


Q3:
What does 'my' do? Is it the same as 'local'?

A3:
To quote from the perl documentation

       `local($x)' saves away the old value of the global vari­
       able `$x', and assigns a new value for the duration of the
       subroutine, which is visible in other functions called
       from that subroutine.  This is done at run-time, so is
       called dynamic scoping.  local() always affects global
       variables, also called package variables or dynamic vari­
       ables.

       `my($x)' creates a new variable that is only visible in
       the current subroutine.  This is done at compile-time, so
       is called lexical or static scoping.  my() always affects
       private variables, also called lexical variables or
       (improperly) static(ly scoped) variables.

 
What this means is that with local, variables are propagated to 
successive function calls. So ...

-- code --
$variable = 'foo';

sub dynamic
{
	local $variable = 'bar';
	print_variable();
}

sub static 
{
	my $variable = 'bar';
	print_variable();

}

sub print_variable
{
	print "variable is $variable\n";

}


print_variable();
static();
dynamic();
-- end code --

prints out


foo
foo
bar


Q4:
Given a string $text containing multiple lines of text, how do you
strip all the html tags?

A4:
The very short answer is :

	$text =~ s/<.*?>//gs;

a longer answer, taken from the perldoc, is 

	$text =~ s/<(?:[^>'"]*|(['"]).*?\1)*>//gs

however, for various reasons (CDATA sections, Javascript, etc) this 
won't work for all cases.

A better solution would be to use the CPAN module HTML::Parser which has 
had years of experience and bug fixing. There's no point reinventing 
the wheel when time could be spent more productively implementing other 
features.

Using this, the way to do it would be :


	my $p = HTML::Parser->new(api_version => 3,
      	       	       	          text_h => [ sub { print shift }, "dtext" ]
                                 );

	$p->parse_file($file);


Q5:
Given a string $text, write a regular expression to strip the
white space from the beginning and end of the string

A5:
	$text =~ s/^\s*(.*?)\s*$/$1/s; # the s at the end is in case there are 
				       # carriage  returns in the string

another method might be
	$text =~ s/(^\s*|\s*$)//sg;


------------------------------------------------
C / C++

Q6:
What is the significance of a const variable?

It is set at declaration and cannot be subsequently change. 

 	const int FOO = 42; /* by convention constants are named UPPER CASE */
	FOO = 13;           /* Error ! */


This is a relatively new feature of C so in some cases it may be better 
to define constants using CPP macros such as 

	#define FOO (42);


Q7:
A structure has an integer, a pointer and a char, how
big is it?

A7:
This is a lot trickier than it looks. 

The following program shows three ways of doing it.

-- code -- 
#include <stdio.h>

struct foo {
	int i;
	char c;	
	char * p;

} __attribute__ ((packed)) foo_bar = {
	1,
	'c',
	"for want of something else"

};

typedef struct foo foo_t;

int main (void)
{


	printf("%d == %d == %d\n", sizeof(foo_bar), 
			           sizeof(foo_t),
			           sizeof(int)+sizeof(char)+sizeof(void *)
		); 

	return 0;
}
-- end code --

The reason why it is tricky is because, without the 

	__attribute__ ((packed)) 

statement the three answers will not come out the same. The reason for
this is that there is a run time penalty for accessing 32 bit ints which
aren't aligned on a dword boundary so GCC pads the structures for speed. 
The above statement prevents this.

Declaring the whole struct as packed won't work in C++ - instead each member 
of the struct has to be declared as packed.

A similar effect, without added syntax, can be achieved by using the command 
line option -fpack-struct on GCC > 2.7.0 (although there were bugs in 2.95.1 
and 2.95.2 that affected this).


Q8:
Whats wrong with the following code:

        char a[256];
        unsigned char x;                  
        for( x=0;
             x<sizeof(a); x++)
                 a[x] =0;

A8:

x is initialised as an int, when it's been declared as a char. 
The code then attempts again to assign an int to a char (the elements 
of a).

This is just bad programming practice though. The real problem is that
since x is unsigned it will overflow at 255 back to 0. Because of this 
the loop will never terminate (since x will never be more than the size 
of a == 256)

Q9: (this question must be answered!)
Write an algorithm to reverse the order of words
within a string, i.e.  given  char *string = "The cow jumped over
the moon" becomes "moon the over jumped cow The".  Use as little
memory space as possible, time is not a factor.

-- code --
#include <stdio.h>
#include <string.h>
#include <stdlib.h>


/* Reverse the order of words in a string */
int main (int argc, char ** argv)
{
	/* The input and the result */
	char * string, * result;

	/* tmp variables */
	char * tmp;	
	

	/* If we've been given an argument, then use that. Otherwise use the default.
	 * we have to strdup it otherwise newer compilers automatically make it const.
	 * This could be fixed by using the -fwriteable-strings flag 
         */
	if (argc>1)
		string = strdup(argv[1]);
	else 
		string = strdup("The cow jumped over the moon"); 

	/* set up a buffer large enough to accomodate the result*/
   	if (!(result = (char *) malloc (sizeof(char) * strlen(string)))) 
	{
                fprintf(stderr, "Failed to malloc buffer\n");
                return -1;
        }


	/* and set the null character so strcat works */
	*result = '\0';	

	/* 
         * go backwards through the string getting everything from the end 
	 * to the first space from the end
         */
	while ((tmp = rindex(string, ' ')))
	{
		/* Add on a space after. n the first run tmp will be 'moon ' */
		strcat(tmp, " ");		

		/* concatenate the last word of the sentence on to the result */
		/* On the first run result should now be 'moon ' */		
		strcat(result, tmp + 1);

		/* set the end of string to be where the last space was */
		/* On the first run, string shuld now be 'The cow jumped over' */
		*tmp  = '\0';
	}

	/* Now stick the last bit of string (in this case 'The') on the end */	
	strcat(result, string);

	/* print out the result */
	printf ("%s\n", result);

	return 0;

}
-- end code --


Q10:
What is a static member of a class (also called: class
variable)? What can it be used for and how do you define it?

A10: 
A static member is a variable that is part of a class but not part of
an object of that class. Therefore it is like a global variable but 
confined to the name space of a class. There is only one copy of a 
static member rather than one copy per class. This means that it is good 
for saving space, useful for defaults and very hand for doing patterns 
like singletons.

To define one you use the syntax 

class Foo
{
	static int bar = 0;

	public:
	   Foo ();
}


to retrieve it or set it use the syntax 

	Foo::bar = 42;

Only static methods may set static members

	static void set_bar(int num);

	Foo::set_bar(42);


Q11:
How do you use exception handling in C++? Briefly explain in
an example how you define exception handling and what statements
are used.


Exception handling in C++ is done using the syntax 

	try {
		// do something
	} catch (Error e) {
		// handle the error
	}

Essentially exceptions are a way of propogating errors up the call chain 
easily without having to pass references to variables. The exception 
handler can rethrow the exception if it decides that it cannot handle 
it.


So, for example :

class Error {

	const char * err;
	public:
		Error(char * err_);
		virtual void print() const { cerr << err; }
}


void foo(int n)
{
	if (n>9)
		throw Error("too large");
	
}


try {
	foo (11);
} catch (Error e) {
 	e.print();
}
	

would print out the string "too large" to stderr.


Q12:
Write a class template for a stack of elements of arbitrary
type. Include the functions "push", "pop", "size" and the
constructor as well as the destructor.

A12:

-- code --
//stack.h
template <class T>
class Stack
{
  struct Link {
	Link * previous;
	Link * next;
	T      value;
	Link (Link * p, Link * n, const T& v): previous(p), next(n), value(v) {}
   };

   Link * head;
   Link * tail;

public:
	Stack() ; 
	~Stack();
	void push(const T&);	
	T pop ();
	int size();
	void printAll();
} ;

//constructor with the default size 10
template <class T>
Stack<T>::Stack()
{
	head = 0;
	tail = 0;
}

// destructor
template <class T>
Stack<T>::~Stack()
{
	// get the first element
        Link * l = head;

	// iterate through the list
        while (l)
        {
		// store temporarily
		Link * tmp = l;
		// get the next item
                l = tmp->next;
		// delete this one
		delete tmp;
        }
	
	// set the list to null;
	head = 0;
	tail = 0;

}


// push an element onto the stack
template <class T>
void Stack<T>::push(const T& value)
{
	// if the list is empty then ..
	if (!head) {
		// create a new head
		head = new Link(0, 0, value);
		// and set the tail to be the head
		tail = head;
		return;
	}

	// otherwise create a new element and link it to the tail
	tail->next = new Link (tail, 0, value);
	// and then set the tail to be the last element
	tail = tail->next;
}

// pop an element off the stack
template <class T>
T Stack<T>::pop()
{
	// if there's nothing on the stack then ...
	if (!tail) 
		// ... return nothing. This should 
		// probably throw an exception
		// as it could give confusing results 
		// if T is int.  
		return 0;

	// ... otherwise get the last value
	Link * tmp = tail;
	// set the tail to be the value before
	tail       = tmp->previous;
	// and chop off the end of the list
	tail->next = 0;

	// and return the value 
	return tmp->value;
}


// get the size of the list
template <class T>
int Stack<T>::size()
{
	// initialise the counter
	int i = 0;
	/// get the first element
	Link * l = head;

	// iterate through
	while (l)
	{
		// increase the count
		i++;
		// get the next value
		l = l->next;

	}	
	// return the count
	return i;

}

// print every element in the list
template <class T>
void Stack<T>::printAll()
{
	// same code as the size method
	// but it prints rather than return 
	// the size
	int i = 0;
	Link * l = head;

	while (l)
	{
		i++;
		cout << i << ": " << l->value << endl;
		l = l->next;

	}	


}

-- end code --


------------------------------------------------
WEB

Q13:
Whats the difference between the GET and POST method?

A13:
The GET method passes CGI parameters by way of the query string. As such 
certain values must be escaped (a space, for example, is escaped to 
%20).

Acccording to the HTTP specification the GET method is idempotent. This 
means that the side effects of more than one GET request is the same as 
with one. What this means in practice is that browsers and proxies may 
cache a GET request and that it isn't an ideal method if you want to log 
or store the results of every request.

POST is passed via the STDIN file handle (unlike via the QUERY_STRING 
environment variable for GET). Because of this it is hidden from the 
user and is much more suitable for things like passing large values 
(such as files or many, many variables). It is not limited in the size 
of data you can send which some GET implentations are. 

One advantage of GET requests is that it is possible to save the current 
state of the request as a single URL which isn't possible with POST.


Q14:
I want to provide a personalised web page, which
presents different data to different users - what URL/CGI
techniques can I use to identify each user? What are the advantages
and disadvantages of each method?

A14:
IP address -
By assuming each user is on a unique IP address then this allows you to 
track a user through the site and store customisations and other data 
between and during sessions. However it is unlikely that a user is 
unique to an IP address and an IP address is unique to a user. 

SessionID -
by appending a unique SessionID (perhaps generated by MD5 hashing some 
data such as the user's IP, the time, the process number and a random 
number) to requests then the user can be tracked. This can either be 
done via GET requests which produces ugly URLs or by POST request which 
means every single link must be a form or by some combination. An 
alternative would be to set the session ID in a cookie but not all 
browsers support cookies.

Username and Password -
This will actually work with in combination with one of the previous 
techniques (usually the sessionid) - a user logins and a new sessionid 
is generated. This sessionid will expire after a while to prevent users 
mistakenly handing out URLs which automatically allow people to log in 
as them.

In practice there is no fool proof way of providing a personalised web 
page. The general technique is to request that the user logs in, the 
HTTP server then sets a cookie (if possible and if desired by the 
user) with their username and password or some sort of authentication 
token in it so that they are automatically logged in when they return 
(or jump straight into the middle of the site). Then they are tracked 
using session ids either in cookies where possible or via CGI parameters 
in hidden field forms or encoded in URLs. The session id usually expires 
based on other parameters such as time and being accessed from another 
IP address or HTTP referer headers. By doing this then almost all 
surfers will be able to have a personalised experience even if they 
refuse cookies.


Q15:
How can a document specify that it should not be cached by a
client or proxy server?

A15: There are several ways to do this and a combination is usually best
to work around bugs and 'features' in various browsers. These boil down to
two basic methods - putting tags in the HTML and sending certain HTTP
headers back.

The HTTP headers relevant are :

Pragma: no cache - most caches do not honour these headers
Expires          - setting a date in the 'past' will, theoretically,
		   prevent a document from being parsed.
Cache-Control    - this is new to HTTP 1.1 but is very flexible  and 
                   powerful.
Last-Modified    - by setting this in the past some proxies will not 
                   cache a document.
ETags            - this HTTP 1.1 feature is like a checksum generated by 
		   the HTTP server to help caches decide whether to 
		   cache or not.

In HTML one can achieve the same effect by using a META tag in the HEAD 
element in the form 

	<META HTTP-EQUIV="name" CONTENT="content">

where name and content are the same as the HTTP headers above. In this
case the *browser* may not cache the page. However this is extremely
unlikely to affect proxies since they do not parse the HTML. It also
doesn't help non HTML documents.

A third method is to generate a unique url by putting a random number in 
it, either as a CGI parameter :

	http://foo.com/nevercache.html?948798739875389753987539875

which should have no effect on the page actually being displayed. However 
some caches may (wrongly) ignore any CGI parameters so using a rewrite 
engine like Apache's mod_rewrite, this could be rewritten as

	http://foo.com/93429487293847234/nevercache.html 

(where the rewrite engine will strip out the random number part). The 
random number would have to be inserted into every link by way of Server 
Side Include or CGI or another rewrite engine (that parses the HTML and 
inserts a new random id into appropriate links every time the page is 
displayed). 

It should be noted that there is no guarantee that pages will not be 
cached since the browser or proxy could be a very naive implementation or, 
at the other end of the scale, something that is 'smart' and attempts to 
do clever tricks anyway. With this in mind a programmer relying on no 
caching would be wise to have mechanisms in place to check to see that 
this is indeed the case.


Q16: What do the numbers 200,302,404 mean to you ?

A16: 
200 - OK
302 - FOUND (i.e the document has moved to a new location but this new 
             location is liable to change frequently. Used by CGI 
             scripts that send a Location: header)
404 - NOT FOUND

Q17: A user complains "http://www.yahoo.de is really slow". How would
you attempt to debug?

A17:
It would depend on the technical competence of the user. The solution is 
to identify the source of the problem. The obvious thing to check would 
be to see if it is the site itself being slow. The way to check this 
would be to see if it is slow for you especially from remote site (check 
using a web browser on a remote machine with X tunneled over shh or 
some other technique) preferably with no caching whatsoever. If it's not 
for you then check with them whether it it is slow *all* the time or 
just sometimes. 

If it's sometimes then you could check to see whether that's because
they're doing a specific query (which should then be investigated using
a profiler by a developer) or possibly because they're being redirected
to a specific machine by the load balancer or because a machine is 
overloaded.

If it's slow all the time then you need to check to see if all other 
sites are also slow or if it's just yahoo.de. If it's all of them then 
checks to see the performance of the machine (if it's a 486DX running 
over a 9600bps phone) would be in order. If it's just yours then you 
could see if certain features such as images, java or flash were 
affecting it. 

A final thing would be to step them through a traceroute of a request to 
see where the bottle neck is - for example a peering arrangement may 
have failed and so that may need to be investigated. 


Q18:
Roughly sketch out a design of an http server. keep it really
simple. with the ability to just respond to GET requests and
the ability to handle multiple connections simultaneously

A18:
The server either sits as a daemon listening on a particulalr port or is
activated via inetd. If it is a daemon then as the request comes in then a
child is forked off. To handle GET requests then the url is parsed and
everything after an initial question-mark (?) is placed in the environment
variable QUERY_STRING. Then the document requested is examined, various
techniques for security such as making the URL absolute and checking to
see that it is not outside a particular directory structure can be used.
Then check to see if the document should be executed (either explicitly
through configuration files and handlers or by checking to see if the
executable permission bit has been set) and if it is then it should be
executed using the shell and the output collected. If there is no output
then the appropriate header is sent back. Otherwise the output is returned
to the client. If it isn't to be executed then the document requested is
examined. If the server does not have permission to read the file then the
appropriate response is sent back (401 or 403) otherwise an HTTP 200 (OK)
header is sent and the document is sent after. After any of these then
the child exits gracefully, cleaning up after itself.

Techniques such as logging, authorization or redirection could also be 
built in.

Q19:
Imagine you have all the resources necessary: what would you do
in order to hack Yahoo!? By hack we mean somehow change 
http://www.yahoo.de 

A19:
The short answer is - get a job there. Or somehow socially engineer my 
way into the building and leave some device to do the nefarious deed for 
me (a micro-pc such as a Cappucino, an Ipaq or a Zaurus would be 
perfect. Or even a Sega Dreamcast but I wouldn't waste a broadband 
adaptor because they're hard to get hold of). Having local access makes 
things a lot easier. You can packet sniff a (non switched) network, walk 
up to the machines, ask people to type in passwords from your machines 
whilst a keyboard sniffer is running or even be given the root password 
if you're a trusted techie. Then you just leak that to somebody else. If 
you're not fussed about getting caught then you walk to the servers (if 
you have physical access to them) and trash them in some way - either by 
simply turning them off or by more extreme methods such as hitting them 
with a fire axe.

Remotely is more difficult. And very tedious. The obvious method would be
to run scans on all the externally visible machines on the yahoo.de
network and build up a map of the topology with the OS, open ports and
services of each machine profiled. To hide oneself, this could be done
through some anonymising tools such as IP spoofing, an anonymiser, a cheap
dialup, hijacking somebody else's bandwidth (at a cyber cafe or by finding
an open wireless node) or getting a shell account on another box.

Once one had a map, you could then try every available remote exploit
(whilst remaining hidden via the techniques above) to gain access to one
of the machines and then try and work one's way to the web servers.

Since yahoo.de probably has load balanced, replicated web servers with
backups, this would be hard, since a simultaneous attack on all servers 
would be necessary to effect a change..

One idea might be to construct an outlook virus or word macro virus that
installs a keyboard sniffer on a machine and then mails the logs to an
anonymous account every day. Or that tries exploits on routers in the
office in the hope that the leased line will go down and the web servers
will be unavailable to the outside world.

A final thing would be to go through the site meticulously attempting 
likely exploits via the cgi scripts (reading the password file, execute 
an arbitary script etc).

Essentially there are a thousand techniques to try, most of which have 
probably already been tried, virtually none of which are likely to work. 
In the end the results don't warrant the effort expended.


------------------------------------------------
UNIX

Q20:
Write a command to recursivly search all html files below the
current directory for the string "ABCDE"


A20: 
find ./ -name '*.htm*' -exec grep -q ABCDE {} \; -print