Week 10 Laboratory Exercises

Objectives

  • Writing and use your own Python modules
  • Exploring modules in Python

Preparation

Before the lab you should re-read the relevant lecture slides and their accompanying examples.

Getting Started

Set up for the lab by creating a new directory called lab10 and changing to this directory.
mkdir lab10
cd lab10

There are some provided files for this lab which you can fetch with this command:

2041 fetch lab10

If you're not working at CSE, you can download the provided files as a zip file or a tar file.

Exercise:
DNA analysis in Python

Your task is to add code to the file dna.py to do DNA analysis.

Don't worry you don't need to know anything about DNA, RNA or base pairs.

You have been given the file test_dna.py that imports dna.py and uses its functions to analyse a file. Do not change test_dna.py. Only change dna.py

You have been given 6 test files data1 .. data6 containing base pairs (again don't worry you don't need to know what base pair is).

The format of the base pair files is simple:

sed -n 1,3p data1
G <-> C
T <-> A
A <-> T
But note one or both element of a base pair may be missing.
grep -E '^ <->|<-> $' data3|head
 <-> A
 <-> T
 <-> A
 <-> G
 <-> A
 <-> G
 <->
 <->
G <->
A <->
Here is how test_dna.py will work when you've completed the functions in dna.py (code>test_dna.py imports dna.py).
./test_dna.py data1
the file data1 is DNA
there are 100 pairs in the file
first 10 pairs:
G <-> C
T <-> A
A <-> T
G <-> C
A <-> T
G <-> C
T <-> A
C <-> G
T <-> A
A <-> T
last 10 pairs:
C <-> G
T <-> A
G <-> C
T <-> A
G <-> C
G <-> C
T <-> A
C <-> G
A <-> T
T <-> A
the most common base is Guanine

The docstrings of the functions in dna.py give you more information about how to complete each function.

def read_dna(dna_file):
    """
    Read a DNA string from a file.
    the file contains data in the following format:
    A <-> T
    G <-> C
    G <-> C
    C <-> G
    G <-> C
    T <-> A
    Output a list of touples:
    [
        ('A', 'T'),
        ('G', 'C'),
        ('G', 'C'),
        ('C', 'G'),
        ('G', 'C'),
        ('T', 'A'),
    ]
    Where either (or both) elements in the string might be missing:
    <-> T
    G <->
    G <-> C
    <->
    <-> C
    T <-> A
    Output:
    [
        ('', 'T'),
        ('G', ''),
        ('G', 'C'),
        ('', ''),
        ('', 'C'),
        ('T', 'A'),
    ]
    """
    pass

def is_rna(dna):
    """
    Given DNA in the aforementioned format,
    return the string "DNA" if the data is DNA,
    return the string "RNA" if the data is RNA,
    return the string "Invalid" if the data is neither DNA nor RNA.
    DNA consists of the following bases:
    Adenine  ('A'),
    Thymine  ('T'),
    Guanine  ('G'),
    Cytosine ('C'),
    RNA consists of the following bases:
    Adenine  ('A'),
    Uracil   ('U'),
    Guanine  ('G'),
    Cytosine ('C'),
    The data is DNA if at least 90% of the bases are one of the DNA bases.
    The data is RNA if at least 90% of the bases are one of the RNA bases.
    The data is invalid if more than 10% of the bases are not one of the DNA or RNA bases.
    Empty bases should be ignored.
    """
    pass


def clean_dna(dna):
    """
    Given DNA in the aforementioned format,
    If the pair is incomplete, ('A', '') or ('', 'G'), ect
    Fill in the missing base with the match base.
    In DNA 'A' matches with 'T', 'G' matches with 'C'
    In RNA 'A' matches with 'U', 'G' matches with 'C'
    If a pair contains an invalid base the pair should be removed.
    Pairs of empty bases should be ignored.
    """
    pass

def mast_common_base(dna):
    """
    Given DNA in the aforementioned format,
    return the most common first base:
    eg. given:
    A <-> T
    G <-> C
    G <-> C
    C <-> G
    G <-> C
    T <-> A
    The most common first base is 'G'.
    Empty bases should be ignored.
    """
    pass

def base_to_name(base):
    """
    Given a base, return the name of the base.
    The base names are:
    Adenine  ('A'),
    Thymine  ('T'),
    Guanine  ('G'),
    Cytosine ('C'),
    Uracil   ('U'),
    return the string "Unknown" if the base isn't one of the above.
    """
    pass

Download dna.py, or copy it to your CSE account using the following command:

cp -n /import/ravel/A/cs2041/public_html/24T1/activities/DNA/files.cp/dna.py dna.py

When you think your program is working, you can use autotest to run some simple automated tests:

2041 autotest DNA 

When you are finished working on this exercise, you must submit your work by running give:

give cs2041 lab10_DNA dna.py

before Monday 22 April 12:00 (midday) (2024-04-22 12:00:00) to obtain the marks for this lab exercise.

Challenge Exercise:
Bashful Python

We have some Shell (Bash) scripts that do arithmetic calculations that we need to translate to Python.

Write a Python program bashpy.py which takes such a Bash script on stdin and outputs an equivalent Python program.

The scripts use the arithmetic syntax supported by Bash (and several other shells). Fortunately, the scripts only use a very limited set of shell features.

You can assume all the features you need to translate are present in the following 4 examples.

  • sum.sh sums a series of integers:
    cat sum.sh
    #!/bin/bash
    
    # sum the integers $start .. $finish
    
    start=1
    finish=100
    sum=0
    
    i=1
    while ((i <= finish))
    do
        sum=$((sum + i))
        i=$((i + 1))
    done
    
    echo $sum
    ./sum.sh
    5050
    ./bashpy.py < sum.sh
    #!/usr/bin/env python3
    
    # sum the integers $start .. $finish
    
    start = 1
    finish = 100
    sum = 0
    
    i = 1
    while i <= finish:
        sum = sum + i
        i = i + 1
    
    print(sum)
    ./bashpy.py < sum.sh | python3
    5050
    
  • double.sh prints powers of two:
    cat double.sh
    #!/bin/bash
    
    # calculate powers of 2 by repeated addition
    
    i=1
    j=1
    while ((i < 31))
    do
        j=$((j + j))
        i=$((i + 1))
        echo $i $j
    done
    ./double.sh
    2 2
    3 4
    4 8
    5 16
    6 32
    7 64
    8 128
    9 256
    10 512
    11 1024
    12 2048
    13 4096
    14 8192
    15 16384
    16 32768
    17 65536
    18 131072
    19 262144
    20 524288
    21 1048576
    22 2097152
    23 4194304
    24 8388608
    25 16777216
    26 33554432
    27 67108864
    28 134217728
    29 268435456
    30 536870912
    31 1073741824
    ./bashpy.py < double.sh
    #!/usr/bin/env python3
    
    # calculate powers of 2 by repeated addition
    
    i = 1
    j = 1
    while i < 31:
        j = j + j
        i = i + 1
        print(i, j)
    ./bashpy.py < double.sh > double.py
    chmod +x double.py
    ./double.py
    2 2
    3 4
    4 8
    5 16
    6 32
    7 64
    8 128
    9 256
    10 512
    11 1024
    12 2048
    13 4096
    14 8192
    15 16384
    16 32768
    17 65536
    18 131072
    19 262144
    20 524288
    21 1048576
    22 2097152
    23 4194304
    24 8388608
    25 16777216
    26 33554432
    27 67108864
    28 134217728
    29 268435456
    30 536870912
    31 1073741824
    
  • pythagorean_triple.sh searches for Pythagorean triples:
    cat pythagorean_triple.sh
    #!/bin/bash
    
    max=42
    a=1
    while ((a < max))
    do
        b=$a
        while ((b < max))
        do
            c=$b
            while ((c < max))
            do
                if ((a * a + b * b == c * c))
                then
                    echo $a $b $c
                fi
                c=$((c + 1))
            done
            b=$((b + 1))
        done
        a=$((a + 1))
    done
    ./bashpy.py < pythagorean_triple.sh
    #!/usr/bin/env python3
    
    max = 42
    a = 1
    while a < max:
        b = a
        while b < max:
            c = b
            while c < max:
                if a * a + b * b == c * c:
                    print(a, b, c)
                c = c + 1
            b = b + 1
        a = a + 1
    ./bashpy.py < pythagorean_triple.sh | python3
    3 4 5
    5 12 13
    6 8 10
    7 24 25
    8 15 17
    9 12 15
    9 40 41
    10 24 26
    12 16 20
    12 35 37
    15 20 25
    15 36 39
    16 30 34
    18 24 30
    20 21 29
    21 28 35
    24 32 40
    
  • collatz.sh prints an interesting series:
    cat collatz.sh
    #!/bin/bash
    
    # https://en.wikipedia.org/wiki/Collatz_conjecture
    # https://xkcd.com/710/
    
    n=65535
    while ((n != 1))
    do
        if ((n % 2 == 0))
        then
            n=$((n / 2))
        else
            n=$((3 * n + 1))
        fi
        echo $n
    done
    ./bashpy.py <collatz.sh
    #!/usr/bin/env python3
    
    # https://en.wikipedia.org/wiki/Collatz_conjecture
    # https://xkcd.com/710/
    
    n = 65535
    while n != 1:
        if n % 2 == 0:
            n = n // 2
        else:
            n = 3 * n + 1
        print(n)
    ./bashpy.py <collatz.sh | python3
    196606
    98303
    294910
    147455
    442366
    221183
    663550
    331775
    995326
    497663
    1492990
    746495
    2239486
    1119743
    3359230
    1679615
    5038846
    2519423
    7558270
    3779135
    11337406
    5668703
    17006110
    8503055
    25509166
    12754583
    38263750
    19131875
    57395626
    28697813
    86093440
    43046720
    21523360
    10761680
    5380840
    2690420
    1345210
    672605
    2017816
    1008908
    504454
    252227
    756682
    378341
    1135024
    567512
    283756
    141878
    70939
    212818
    106409
    319228
    159614
    79807
    239422
    119711
    359134
    179567
    538702
    269351
    808054
    404027
    1212082
    606041
    1818124
    909062
    454531
    1363594
    681797
    2045392
    1022696
    511348
    255674
    127837
    383512
    191756
    95878
    47939
    143818
    71909
    215728
    107864
    53932
    26966
    13483
    40450
    20225
    60676
    30338
    15169
    45508
    22754
    11377
    34132
    17066
    8533
    25600
    12800
    6400
    3200
    1600
    800
    400
    200
    100
    50
    25
    76
    38
    19
    58
    29
    88
    44
    22
    11
    34
    17
    52
    26
    13
    40
    20
    10
    5
    16
    8
    4
    2
    1
    

When you think your program is working, you can use autotest to run some simple automated tests:

2041 autotest bashpy 

When you are finished working on this exercise, you must submit your work by running give:

give cs2041 lab10_bashpy bashpy.py

before Monday 22 April 12:00 (midday) (2024-04-22 12:00:00) to obtain the marks for this lab exercise.

Challenge Exercise:
When Regular Expressions Aren't Regular

Write a regular expression that validates a JSON file.

In other words, write a regex that matches a string iff that string is valid JSON.

Here is a test program assist you in doing this:

#! /usr/bin/env python3

from sys import argv, stderr
import regex

regex.DEFAULT_VERSION = regex.V1

assert len(argv) == 3, f"Usage: {argv[0]} <json file> <regex file>"

json_file, regex_file = argv[1], argv[2]

try:
    with open(json_file) as json_data, open(regex_file) as regex_data:
        if regex.search(regex_data.read(), json_data.read(), timeout=5):
            # In the test suite, all files that start with "y_" should be valid.
            print(f"Valid   JSON file: {json_file}")
        else:
            # In the test suite, all files that start with "n_" should be invalid.
            print(f"Invalid JSON file: {json_file}")

except TimeoutError as e:
    # Allow a timeout error to signal that the jason file is not valid
    print(f"Invalid JSON file: {json_file}")
    # This is printed to stderr so that it is not captured by the test
    print(f"5 second time limit reached while reading {json_file}", file=stderr)

Download test_regex_json.py, or copy it to your CSE account using the following command:

cp -n /import/ravel/A/cs2041/public_html/24T1/activities/regex_json/test_regex_json.py test_regex_json.py

You have been given a directory JSONTestSuite containing a number of JSON files.

There are two types of files in this directory:
Files starting with y_ are valid JSON files.
Files starting with n_ are invalid JSON files.

Put your solution in regex_json.txt:

For example to test the regex ^.+$

chmod 755 test_regex_json.py
unzip JSONTestSuite.zip
cat regex_json.txt
^.+$
./test_regex_json.py JSONTestSuite/y_array_heterogeneous.json regex_json.txt
Valid   JSON file: JSONTestSuite/y_array_heterogeneous.json
./test_regex_json.py JSONTestSuite/n_array_star_inside.json regex_json.txt
Valid   JSON file: JSONTestSuite/n_array_star_inside.json
This should be Invalid so the regex is incorrect

If your solution is correct, all files in the JSONTestSuite starting with y_ should be labelled valid, and all files starting with y_ should be labelled invalid,

When you think your program is working, you can use autotest to run some simple automated tests:

2041 autotest regex_json 

When you are finished working on this exercise, you must submit your work by running give:

give cs2041 lab10_regex_json regex_json.txt

before Monday 22 April 12:00 (midday) (2024-04-22 12:00:00) to obtain the marks for this lab exercise.

Submission

When you are finished each exercises make sure you submit your work by running give.

You can run give multiple times. Only your last submission will be marked.

Don't submit any exercises you haven't attempted.

If you are working at home, you may find it more convenient to upload your work via give's web interface.

Remember you have until Week 11 Monday 12:00:00 (midday) to submit your work.

You cannot obtain marks by e-mailing your code to tutors or lecturers.

You check the files you have submitted here.

Automarking will be run by the lecturer several days after the submission deadline, using test cases different to those autotest runs for you. (Hint: do your own testing as well as running autotest.)

After automarking is run by the lecturer you can view your results here. The resulting mark will also be available via give's web interface.

Lab Marks

When all components of a lab are automarked you should be able to view the the marks via give's web interface or by running this command on a CSE machine:

2041 classrun -sturec