Assignment

Updates

This section is a record of updates to the assignment spec after the initial post. Updates will also be bolded in the assignment text.

07-11-2019: Update query range in 1(b) to be inclusive exclusive.
10-11-2019: Changed due date to 9pm the Monday after (I’ve been advised it is best not to have assignments due at midnight).
12-11-2019: Added Submission section.
13-11-2019: Added some hints (all in bold)

Submission

Submission will be done on DOMJudge, which can be accessed here. This link also has the canonical problem statements, with bounds and sample data.

You will need to login with your assignment credentials. These will be sent to your UNSW account in an email titled “COMP4128 Assignment Account”. If you have not received this yet the most likely explanation is I did not know your group for the assignment the last time I sent out emails. (Currently the last time I sent out emails is 2019-11-16 10:00:00)

If you have not done so already, please email me your group (you need to do this even if you are doing the assignment solo) in the following format:

Recipient: cs4128@cse.unsw.edu.au
Subject: “Assignment 1 Group”
Body: a single line per group member (with no empty lines) containing just the zID. Include your own zID. E.g: If I (z3333333) am in a group with z4444444 then I would send an email with just the body

z3333333

z4444444

Overview

So far we have not talked much about the difference between online and offline algorithms. This assignment will explore a bit of the difference between the 2, while introducing some useful techniques we did not get a chance to cover in the course. Along the way we will illustrate the techniques with some (hopefully) interesting problems.

This assignment can be attempted individually or in pairs.

Your mark will be automatically calculated based on the problems you solve. There is no written component. While hints and guidance will be given, some of the problems are still intended to be quite difficult and require effort to solve (many of the problems would be suitable for the last problem of a problem set if not for the hints). For this reason, I recommend reading through the problems early so you can think about them in the back of your head.

Weight: 15%

Due at 21:00:00 on Monday the 25th of November (right after the end of week 10).

Late penalty: the maximum available mark for the assignment is reduced by 3 marks for each day late, rounded up. That is, your mark is capped at 12/15 if you are one day late, 3/15 if four days late, and zero if more than 4 days late.

A bit of background

A loose unifying theme for this assignment is the distinction between online and offline problems. The basic context is the following: Suppose you have a problem in which you have to support some updates and queries.

In the online variant of the problem, you would be expected to answer each query before receiving future updates and queries. This mirrors the real world scenario of responding to live requests and updates.
In the offline variant of the problem, you receive all queries and updates in advance. This mirrors processing all your data in a batch.

So far the problems we have seen have all been offline. In programming competitions, online problems are generally implemented as one of the following:

All data is given at once but each query/update is hashed using the result of the previous query. A typical phrase you might see is “Let last be the answer to the last query. For each query, XOR each of the arguments with last before applying the query”.
Your program writes as per usual to stdin and stdout. However, stdin and stdout are linked to a judge program which only writes queries to stdin once it has received a correct output from stdout.
You are given a header file your program must include. The header file contains the functions you must call to receive updates and answer queries. On submission, your program is compiled with an implementation of the header that also judges your outputs.

Whether a problem is online or offline is often a crucial part of a problem. Certain problems are significantly easier (or only solvable) offline. We will see a famous example in this assignment. This is because certain techniques require having all the data in advance. The most fundamental example is sorting the data and performing a linear sweep (as in the Paradigms lecture slides), this is literally impossible if we don’t actually have all the data.

In this assignment we will explore:

One of the fundamental techniques for online problems, persistence. Persistence is also interesting in its own right.
An interesting technique for solving Dynamic Connectivity offline.
A nice problem that relies on essentially persisting the immediate steps of Kruskal’s algorithm.

Part 1: Persistent Range Trees

One very useful and important class of algorithms we didn’t get a chance to cover is persistent data structures. Data structures we’ve seen so far are “ephemeral”, updates mutate the state and we can only query the latest state. Persistent data structures, on the other hand, support “versioning”. They allow you to query (and update, though this appears less often in competitions) past versions of the data structure! For many data structures this can be done with O(1) additional space and slowdown per query.

Our setting is a data structure consisting of nodes and pointers (such as a range tree!). There are 2 basic approaches to persistence, fat nodes and path copying:

Fat nodes store within each node a version history of all modifications.
Path copying creates a copy of all nodes that have been modified for each update. This includes the parents of any modified nodes as the pointers to their children will have changed. This eventually terminates at the root node which is copied each update.
There is also a hybrid approach.

The most useful persistent data structure in programming competitions is persistent range trees (what a surprise). These are generally implemented by path copying.

Part 1 Overview

In this part:

First, you will see a problem where being offline is critical.

You will then be asked to code a persistent range tree. Persistent data structures are useful for many reasons. One of their less obvious applications is they are a powerful tool for reducing online questions to offline questions.

Finally, you will apply persistent range trees to solve the online variant of the original problem.

First, let’s get a sense of why being offline is sometimes so crucial.

Part 1(a): Distinct Queries [2 marks]

Given an array of N integers, answer Q queries of “given a pair (i, j) (1 <= i <= j <= N), how many numbers appear exactly once in the range A[i,..j].

Queries are given offline. Aim for O((N + Q) log N).

Note: This problem is non-trivial, do not be fooled by it being part 1(a). You will probably have to think for a while to solve this. It might be a better idea to do part 1(b) and other parts first.

Hint: Crucially this problem is offline.

Part 1(b): Persistent Segment Trees [2 marks]

This part will essentially just be implementing a persistent range tree.

So far we’ve been implementing range trees within arrays. However, it is easier to path copy if we use structs and pointers instead. The key is, each update to a range tree only updates and traverses O(log n) nodes. Hence on an update, we just need to path copy these O(log n) nodes!

The following is a rough gist of one way of implementing persistence (though it is of course not the only way. Feel free to look online for further explanation):

In your update function recurse as per usual. However, in the base case, instead of updating the node in place, create a copy of the node and update the copy.
Instead of returning void, have your update function return the copy of the node in the base case. This allows the parent to update its children to point to the new copy.
However, in doing so, you need to actually copy the parent node and modify the copy, not the original.
Because of this, all nodes you traverse in your update function will need to be copied.
Each update, the root will be copied. Store each of these copies of the root in an array. For each query, perform the query on the appropriate root in your array.

Using this, we get persistence with O(1) overhead for time complexity, however it does require O(log N) extra space per update.

Actual Problem Statement:

Given an A[N], initially all 0. Support the following operations:

U i j c. Set A[i,..,j) to c. 0 <= i < j <= N.
Q v i j. Consider the array after the vth update. What was the sum of the subarray A[i,..,j)? (07-11-2019: changed this from [i,..,j] to [i,..j)) 0 <= i < j <= N. Let U be the number of updates prior to this query. Then 0 <= v <= U. If v = 0, use the initially all 0 array.

Do this online. Aim for O(Q log N + N) time and space complexity.

Note: If you lazy propagate, you’ll need to be a bit careful your lazy propagation interacts correctly with your persistence. This requires a bit of technical care that you’ll need to work out but isn’t overly challenging.

Note2: Crucially, you should maintain that you never modify a node after it has been created. So whenever you wish to modify a node you need to create a copy. Failure to maintain this quickly leads to debugging nightmares.

Part 1(c): Distinct Queries Online [1 mark]

Repeat part 1(a) except online.

Part 2: Offline Dynamic Connectivity

By now you’ve seen Union Find which allows you to maintain connectivity of a graph, assuming edges are only added but never deleted. You have also seen the opposite in Problem Set 2 D, Ancient Berland Roads. Supporting both in general is doable and well known, but very challenging. Approaches I’m aware of either use advanced data structures we have not covered yet or involve a lot of details.

However, remarkably there is a simpler algorithm if the problem is offline! The general technique is also useful in other situations where deletion is difficult.

Part 2 Overview

In this part:

You will build up an algorithm for general offline dynamic connectivity. In 2(a), you are presented with a limited form of the full problem where undos are the only form of deletions present. This turns out to make the problem significantly more approachable.

Part 2(b) then details a very clever reduction of the full problem to the variant in 2(a). Your task here is to understand the given sketch, fill in any omitted details and implement it.

It is based on the following restricted version of connectivity being solvable:

Part 2(a): Connectivity with Undos [3 marks]

Given a graph with V vertices and no edges. Support the following queries:

Add an undirected edge a b.
Delete the most recently added edge that hasn’t been deleted.
Query, are vertices a and b currently connected?

This part will be offline however I don’t think it matters much. Aim for O(Q log V).

Hint: There must be something special about undos that makes the problem a lot easier than general deletions.

Part 2(b): Offline Dynamic Connectivity [2 marks]

In the general setting, any edge can be deleted, not just the most recent. The general case involves a very clever reduction to part 2(a).

Instead of thinking of edges as deleted, think of each edge as a range update that is valid for a range of queries. We store these updates in a range tree (where else). We perform a DFS over this range tree to answer all our queries. On entering a node, we enact the updates stored in the node. On leaving a node, we undo all the updates stored in the node.

Work out the details and solve the following:

Actual Problem Statement:

Given a graph with V vertices and no edges. Support the following queries:

Add an undirected edge a b.
Delete the edge added in query q. It is guaranteed query q was an add query and the edge added in query q has not already been deleted.
Query, are vertices a and b currently connected?

Queries are given offline. Aim for O(Q log Q log V).

Part 3: Persisting Kruskal’s Algorithm

This section is more niche but I find it a very cute technique. We will build up to the solution for an interesting, hard problem. The title says persisting but it only bears superficial similarity to part 1. While we will persist data that is normally transient, we use a different technique to do so.

Union Find and Kruskal’s Algorithm can be thought of as incrementally building a forest. So far we have mostly only cared about the end result of Kruskal’s Algorithm. However, sometimes we want to answer queries that only allow us to use a subset of the edges. In such cases, it is useful to note that if you do Union Find and Kruskal’s in the right way that the resulting forest contains in some sense the intermediate forests built by Kruskal’s.

Part 3 Overview

In this part:

You have been given a difficult problem (3(c)) that requires a few reductions before becoming solvable. Parts 3(a) and 3(b) aim to show the first (and most important) of these reductions.

Part 3(a) presents a relatively standard offline problem.

Part 3(b) asks for the online variant. Here, instead of answering the queries directly as in 3(a), you will first build a structure that you can query to get the answers.

Finally 3(c) is a difficult but interesting problem. My opinion is initially 3(c) seems impossible as the problem asked appears very ad-hoc. As a first step, think about how your solution to 3(b) presents a framework to approach 3(c) with.

One takeaway I hope to impart from 3(c) is how one might attack a problem in steps, each time making the problem a bit less adhoc and a bit more manageable.

First, let’s try an offline variant of a subproblem we might be interested in.

Part 3(a): Offline Connected Components [1 mark]

You’ve been given a schematic for the upcoming construction of GraphTown. GraphTown has V vertices and E edges. Initially, at time 0 none of the edges have been built. Each edge is undirected, connecting two cities ai bi and has a time ti at which it will be built.

Answer Q queries of, at time t how many cities could city a reach?

Queries are given offline. Aim for about O((E + Q) log V).

Part 3(b): Online Connected Components [1 mark]

Do the above online. That is the queries will be given online (you will be given the schematic of the town in advance though)

Hint: The trick will be not to persist the intermediate steps as in Part 1 but to recover the intermediate forests from the final result of Kruskal’s. Think about how to do this.

Part 3(c): Werewolf [3 marks]

Solve this problem.

Note: The above should give you some framework to approach this problem with. The problem however is still non-trivial and you will likely have to think for a while to fill in the intermediate steps. But if you have been following along with the lectures, it should be doable!

Judging environment

Problems to submit to will shortly be added to DOMJudge. I will send out a class email when they are available. Same as problem sets and contests, this assignment will be solely automarked.

Acadamic Honesty and Plagiarism

The course policy can be found here.

You should not collaborate with any student other than your partner for the assignment. In particular, sharing any code is strictly prohibited.

I am aware most of these problems have solutions readily available.

Do NOT copy code from the internet.
Do NOT look up editorials online for the problems.

I strongly encourage you to put an earnest attempt into solving these problems yourself. If you need help, your tutors can help provide hints.