Sie sind auf Seite 1von 16

Valentina.Crisan@academyofdata.

com 

Throughout​ ​the​ ​course​ ​we​ ​will​ ​work​ ​with​ ​2​ ​sets​ ​of​ ​data:

Movielens​ ​data​ ​set​ ​(real​ ​data):​​ ​6k​ ​users,​ ​4k​ ​movies,​ ​1​ ​million​ ​ratings​ ​-​ ​we​ ​copy​ ​only​ ​10k
entries
Weather​ ​data​ ​(random​ ​generated)​​ ​500k​ ​entries

Movielens​ ​data​ ​details:

Movies.csv​ ​extract

mid title genres year

1 Toy​ ​Story​ ​(1995) {Animation,Children's,Comedy} 1995

2 Jumanji​ ​(1995) {Adventure,Children's,Fantasy} 1995

3 Grumpier​ ​Old​ ​Men​ ​(1995) {Comedy,Romance} 1995

4 Waiting​ ​to​ ​Exhale​ ​(1995) {Comedy,Drama} 1995

5 Father​ ​of​ ​the​ ​Bride​ ​Part​ ​II​ ​(1995) {Comedy} 1995

Users.csv​ ​extract

uid gender age ocupation zip

1 F 1 10 48067

2 M 56 16 70072

3 M 25 15 55117

4 M 45 7 0246

Ratings2.csv

ruid rmid rating timestamp

1 1193 5 978300760

1 661 3 978302109

1 914 3 978301968

1 3408 4 978300275

1 2355 5 978824291

Denormalized​ ​data​ ​-​ ​Ratings.csv​ ​data

uid age gender ocupation zip rating datetime mid title year genres

5234 25 F 0 60657 1 2000-06-19​ ​20:40:40 31 Dangerous​ ​Minds​ ​(1995) 1995 {Drama}


Valentina.Crisan@academyofdata.com 

1835 25 M 19 11501 3 2000-11-22​ ​08:11:31 31 Dangerous​ ​Minds​ ​(1995) 1995 {Drama}

6035 25 F 1 78734 2 2000-04-26​ ​01:25:35 31 Dangerous​ ​Minds​ ​(1995) 1995 {Drama}

4637 35 F 9 48302 3 2000-07-19​ ​18:48:41 31 Dangerous​ ​Minds​ ​(1995) 1995 {Drama}

3841 45 M 18 26101 3 2000-08-11​ ​13:14:31 31 Dangerous​ ​Minds​ ​(1995) 1995 {Drama}

2242 18 M 4 60115 5 2000-11-19​ ​02:40:43 31 Dangerous​ ​Minds​ ​(1995) 1995 {Drama}

weather-data.csv

w_id​​ ​ ​ ​ ​ ​ ​ ​|​ ​event_time​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​|​ ​temperature

------------+---------------------------------+-------------

​ ​PNOSCPQCNX​​ ​|​ ​2015-01-01​ ​00:02:54.000000+0000​​ ​|​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​20

​ ​PNOSCPQCNX​​ ​|​ ​2015-01-01​ ​00:10:00.000000+0000​​ ​|​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​23

​ ​PNOSCPQCNX​​ ​|​ ​2015-01-01​ ​00:29:53.000000+0000​​ ​|​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​20

​ ​PNOSCPQCNX​​ ​|​ ​2015-01-01​ ​01:04:22.000000+0000​​ ​|​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​12

​ ​PNOSCPQCNX​​ ​|​ ​2015-01-01​ ​01:17:15.000000+0000​​ ​|​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​18

​ ​PNOSCPQCNX​​ ​|​ ​2015-01-01​ ​02:52:25.000000+0000​​ ​|​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​23

​ ​PNOSCPQCNX​​ ​|​ ​2015-01-01​ ​03:09:17.000000+0000​​ ​|​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​11

​ ​PNOSCPQCNX​​ ​|​ ​2015-01-01​ ​03:14:57.000000+0000​​ ​|​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​13

​ ​PNOSCPQCNX​​ ​|​ ​2015-01-01​ ​04:13:33.000000+0000​​ ​|​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​14

​ ​PNOSCPQCNX​​ ​|​ ​2015-01-01​ ​04:47:56.000000+0000​​ ​|​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​13

All​ ​the​ ​above​ ​files​ ​can​ ​be​ ​found​ ​in​ ​the​ ​data​ ​directory
(both​ ​cloud​ ​and​ ​local​ ​installations).

cd​ ​/data
Ls​ ​-l

Note:​ ​for​ ​the​ ​intro​ ​part​ ​we​ ​play​ ​a​ ​bit​ ​with​ ​different
data​ ​types​ ​and​ ​we​ ​insert​ ​data​ ​manually,​ ​before​ ​we
work​ ​with​ ​the​ ​above​ ​data.
Valentina.Crisan@academyofdata.com 

CQL​ ​exercises
Reference​ ​for​ ​syntax​ ​questions:
http://docs.datastax.com/en/cql/3.3/cql/cql_using/useAboutCQL.html

Run​ ​Cqlsh

1. Create​ ​a​ ​Keyspace​ ​-​ ​for​ ​local​ ​installations​ ​use


metro_systems,​ ​but​ ​for​ ​cloud​ ​installations​ ​use​ ​your​ ​name
(e.g.​ ​my​ ​keyspace​ ​is​ ​valentina_c)​ ​--​ ​since​ ​all​ ​cloud​ ​users​ ​need
to​ ​create​ ​a​ ​separate​ ​working​ ​space.​ ​You​ ​will​ ​use​ ​that​ ​name
instead​ ​of​ ​metro_systems​ ​in​ ​the​ ​commands​ ​below.

CREATE​ ​KEYSPACE​ ​metro_systems​ ​WITH​ ​REPLICATION​ ​=​ ​{​ ​'class'​ ​:


'SimpleStrategy'​ ​,​ ​'replication_factor'​ ​:​ ​1​ ​};

Desc​ ​keyspaces​ ​ ​ ​//​ ​will​ ​list​ ​all​ ​keyspaces​ ​on​ ​the​ ​cluster

describe​ ​metro_systems;​ ​//​ ​you​ ​can​ ​see​ ​the​ ​Replication​ ​factor​ ​&​ ​strategy
set​ ​at​ ​keyspace​ ​level

Exit​ ​cqlsh​ ​and​ ​run​ ​nodetool​ ​status​ ​command​ ​-​ ​this​ ​should​ ​give​ ​you​ ​an​ ​image​ ​of​ ​your​ ​cluster
(even​ ​if​ ​it’s​ ​a​ ​1​ ​node​ ​cluster).​ ​Note​ ​the​ ​datacenter​ ​name.

Re-enter​ ​Cqlsh

Let’s​ ​create​ ​another​ ​keyspaceNetwork​ ​Topology​ ​replication​ ​strategy​ ​and​ ​Replication​ ​factor
1.

CREATE​ ​KEYSPACE​ ​metro_systems1​ ​ ​WITH​ ​REPLICATION​ ​=​ ​{​ ​'class'​ ​:


'NetworkTopologyStrategy',​ ​ ​ ​'datacenter1'​ ​:​ ​1​ ​};

​ ​describe​ ​metro_systems1;

Let’s​ ​alter​ ​a​ ​created​ ​keyspace​ ​and​ ​change​ ​the​ ​property​ ​durable​ ​writes:

ALTER​ ​KEYSPACE​ ​metro_systems1​ ​WITH​ ​REPLICATION​ ​=​ ​{​ ​'class'​ ​:


'NetworkTopologyStrategy'​ ​,​ ​ ​'datacenter1'​ ​:​ ​1}​ ​and​ ​DURABLE_WRITES​ ​=
false;

OR​ ​if​ ​we​ ​want​ ​to​ ​modify​ ​the​ ​replication​ ​factor​ ​from​ ​1​ ​to​ ​3:

ALTER​ ​KEYSPACE​ ​metro_systems1​ ​WITH​ ​REPLICATION​ ​=​ ​{​ ​'class'​ ​:


'NetworkTopologyStrategy'​ ​,​ ​ ​'datacenter1'​ ​:​ ​3};
Valentina.Crisan@academyofdata.com 

ALTER​ ​KEYSPACE​ ​valentina_c_1​ W


​ ITH​ ​REPLICATION​ ​=​ ​{​ ​'class'​ ​:
'NetworkTopologyStrategy'​ ​,​ ​ ' ​ datacenter1'​ ​:​ ​3};

Desc​ ​metro_systems1

Let’s​ ​drop​ ​keyspace​ ​metro_systems1​:

DROP​ ​KEYSPACE​ ​metro_systems1;

Stop:​ ​let’s​ ​explain​ ​what​ ​means​ ​durable​ ​writes​ ​option​ ​in​ ​the​ ​keyspace​ ​description?

2.​ ​Let’s​ ​create​ ​and​ ​modify​ ​a​ ​column​ ​family

First,​ ​connect​ ​to​ ​the​ ​created​ ​keyspace:​ ​i.e.​ ​all​ ​further​ ​CQL​ ​commands​ ​will​ ​be​ ​executed​ ​in​ ​the
context​ ​of​ ​the​ ​chosen​ ​keyspace.

USE​ ​metro_systems;

2.1.​ ​Create​ ​Table​ ​users​ ​with​ ​the​ ​next​ ​structure​:​ ​userid​ ​(uuid),​ ​gender​ ​(text),​ ​age​ ​(int),
occupation​ ​(text),​ ​zipcode​ ​(int)​ ​and​ ​compound​ ​primary​ ​key:​ ​partition​ ​key​ ​userid​ ​and​ ​cluster
key​ ​gender.

CREATE​ ​TABLE​ ​users​ ​(


userid​ ​uuid,
name​ ​text,
gender​ ​text,
age​ ​int,
occupation​ ​text,
zipcode​ ​int,
PRIMARY​ ​KEY​ ​((userid),name));

DESCRIBE​ ​users;​ ​//​ ​you​ ​can​ ​see​ ​all​ ​properties​ ​of​ ​the​ ​table

2.2.​ ​ ​ALTER:​ ​Add/Delete/Rename​ ​columns,​ ​Alter​ ​Column​ ​type;

ALTER​ ​TABLE​ ​users​ ​ADD​ ​relatives​ ​int;​ ​ ​ ​//​ ​add​ ​a​ ​new​ ​column​ ​to​ ​the​ ​table 

SELECT​ ​*​ ​FROM​ ​users;

ALTER​ ​TABLE​ ​users​ ​RENAME​ ​userid​ ​TO​ ​uid;​​ ​//rename​ ​column 


Valentina.Crisan@academyofdata.com 

ALTER​ ​TABLE​ ​users​ ​ALTER​ ​relatives​ ​type​ ​blob;​ ​ ​//​ ​alter​ ​column​ ​type​ ​from​ ​int​ ​to​ ​blob  

Describe​ ​users​;​ ​//​ ​to​ ​see​ ​the​ ​types​ ​of​ ​the​ ​columns 

Please​ ​note​:​ ​Alter​ ​columns​ ​types,​ ​some​ ​changes​ ​will​ ​not​ ​be​ ​possible,​ ​like​ ​for​ ​example: 
varchar​ ​to​ ​ascii.  

ALTER​ ​TABLE​ ​users​ ​DROP​ ​relatives;​ ​ ​ ​//delete​ ​the​ ​new​ ​added​ ​column​ ​to​ ​the​ ​table 

SELECT​ ​*​ ​FROM​ ​users;

Describe​ ​users;

DROP​ ​TABLE​ ​users;​ ​//​ ​deletes​ ​all​ ​the​ ​data​ ​of​ ​the​ ​table​ ​including​ ​the​ ​table​ ​structure​ ​& 
indexes 

Desc​ ​tables;​ ​ ​//​ ​no​ ​tables 

2.3.​ ​Insert​ ​values​ ​in​ ​the​ ​created​ ​table

Assuming​ ​you​ ​dropped​ ​the​ ​table​ ​users​ ​previously,​ ​let’s​ ​re-create​ ​it.

CREATE​ ​TABLE​ ​users​ ​(


uid​ ​uuid,
name​ ​text,
gender​ ​text,
age​ ​int,
occupation​ ​text,
zipcode​ ​int,
PRIMARY​ ​KEY​ ​((uid),name));

We​ ​have​ ​a​ ​table,​ ​now​ ​let’s​ ​insert​ ​data​ ​manually.​ ​We​ ​will​ ​learn​ ​later​ ​also​ ​how​ ​to​ ​insert​ ​data
from​ ​a​ ​file,​ ​but​ ​for​ ​the​ ​moment​ ​let’s​ ​do​ ​some​ ​manual​ ​inserts/updates.

Observation​:​ ​the​ ​UUID​ ​can​ ​be​ ​generated​ ​automatically​ ​with​ ​uuid​ ​();

INSERT​ ​INTO​ ​users​ ​(uid,​ ​name,​ ​age,​ ​gender,​ ​occupation,​ ​zipcode)​ ​VALUES
(5132b130-ae79-11e4-ab27-0800200c9a66,​ ​'Andrew',​ ​25,​ ​'M',​ ​'plummer',
11501);

INSERT​ ​INTO​ ​users​ ​(uid,​ ​name,​ a


​ ge,​ ​gender,​ ​occupation,​ ​zipcode)​ ​VALUES
(UUID(),'Dana',​ ​24,​ ​'F','data​ s ​ cientist',​ ​11501);
Valentina.Crisan@academyofdata.com 

SELECT​ ​*​ ​FROM​ ​users;

TRACING​ ​ON;​ ​//​ ​allows​ ​to​ ​see​ ​the​ ​events​ ​in​ ​background​ ​for​ ​Cassandra.

INSERT​ ​INTO​ ​users​ ​(uid,​ ​name,​ ​age,​ ​gender,​ ​occupation,​ ​zipcode)​ ​VALUES
(5132b130-ae79-11e4-ab27-0800200c9a66,​ ​'Lili',​ ​24,​ ​'F',​ ​'actress',​ ​11503);
//​ ​note​ ​that​ ​I​ ​am​ ​inserting​ ​data​ ​in​ ​the​ ​same​ ​partition​ ​as​ ​the​ ​first
insert.​ ​Given​ ​that​ ​the​ ​Clustering​ ​Key​ ​is​ ​different​ ​the​ ​data​ ​will​ ​be​ ​added
to​ ​the​ ​existing​ ​partition.

TRACING​ ​OFF

SELECT​ ​*​ ​FROM​ ​users;​ ​//​ ​remember​ ​the​ ​way​ ​Cassandra​ ​stores​ ​data​ ​is
different​ ​than​ ​the​ ​CQL​ ​view.

Remember​ ​for​ ​the​ ​next​ ​section:​ ​Cassandra​ ​doesn’t​ ​read​ ​(by​ ​default)​ ​before
performing​ ​writes.

2.4.​ ​Let’s​ ​do​ ​also​ ​some​ ​upserts

UPDATE​ u ​ sers​ ​SET​ a


​ ge​ ​=​ 3​ 5​ ​WHERE​ ​uid​ ​=​ 5
​ 132b130-ae79-11e4-ab27-0800200c9a66​ ​and
name​ ​=​ '​ Andrew';​ / ​ /​ ​we​ u​ pdate​ ​the​ ​age​ o ​ f​ ​an​ ​existing​ ​user

Select​ ​*​ ​from​ ​users;​ ​//​ ​to​ ​see​ ​the​ ​change

UPDATE​ ​users​ ​SET​ ​age​ ​=​ ​35​ ​WHERE​ ​uid​ ​=​ ​5132b130-ae79-11e4-ab27-0800200c9a77​ ​and
name​ ​=​ ​'Andrew';​ ​//​ ​the​ ​UUID​ ​-​ ​partition​ ​key​ ​-​ ​is​ ​not​ ​existent​ ​thus​ ​the​ ​update
will​ ​do​ ​an​ ​insert

SELECT​ ​*​ ​from​ ​users;​ ​//​ ​let’s​ ​check​ ​that​ ​everything​ ​went​ ​well

UPDATE​ u ​ sers​ ​SET​ a ​ ge​ ​=​ ​44​ ​WHERE​ ​uid​ ​=​ ​5132b130-ae79-11e4-ab27-0800200c9a66​ ​and
name​ ​=​ '​ Elaine';​ / ​ /​ ​the​ ​partition​ ​key​ ​existing​ ​but​ ​not​ ​with​ ​this​ ​clustering
key,​ ​a​ n ​ ew​ ​entry​ i ​ nside​ ​an​ ​existing​ ​partition​ ​will​ ​be​ ​generated

Select​ ​*​ ​from​ ​users;

Select​ ​*​ ​from​ ​users​ ​where​ ​uid​ ​=​ ​5132b130-ae79-11e4-ab27-0800200c9a66;​ ​//​ ​to
see​ ​all​ ​entries​ ​in​ ​this​ ​partition

Lets​ ​do​ ​some​ ​more​ ​inserts.​ ​The​ ​first​ ​row​ ​inserted​ ​will​ ​have​ ​a​ ​new​ ​partition​ ​and​ ​cluster​ ​key​.​ ​The
second​ ​row​ ​will​ ​have​ ​the​ ​same​ ​partition​ ​key​ ​but​ ​a​ ​different​ ​clustering​ ​key​.

insert​ ​into​ ​users​ ​(uid,​ ​name,​ ​age,​ ​gender,​ ​occupation,​ ​zipcode)​ ​values
(uuid(),'John',35,'M','data​ ​analyst',11502);​ ​//​ ​see​ ​the​ ​uuid​ ​generated​ ​here
and​ ​use​ ​it​ ​for​ ​the​ ​next​ ​insert

insert​ ​into​ ​users​ ​(uid,​ ​name,​ ​age,​ ​gender,​ ​occupation,​ ​zipcode)​ ​values
(90278d0f-6053-4c87-a1f3-69a05853d2a8,'Elaine',34,'F','engineer',11502);
Valentina.Crisan@academyofdata.com 

Select​ ​*​ ​from​ ​users;

Insert​ ​into​ ​users​ ​(uid,name,​ ​age)​ ​values(


90278d0f-6053-4c87-a1f3-69a05853d2a8,​ ​'Elaine',43);​ ​//​ ​see​ ​how​ ​insert​ ​can
be​ ​used​ ​as​ ​an​ ​update​ ​as​ ​well

Before​ ​we​ ​move​ ​on,​ ​type​ ​exit​ ​in​ ​cqlsh​ ​and​ ​lets​ ​see​ ​some​ ​stats​ ​for​ ​our​ ​table.

Nodetool​ ​cfstats​ ​keyspace.table​ ​//generic​ ​command,​ ​see​ ​below​ ​for​ ​metro_systems


keyspace

Nodetool​ ​cfstats​ ​metro_systems.users​ ​ ​//​ ​for​ ​cloud​ ​users​ ​use​ ​yourname.users

See​ ​that​ ​data​ ​is​ ​still​ ​in​ ​memtable,​ ​no​ ​SStables​ ​generated,​ ​see​ ​also​ ​the​ ​estimated​ ​number
of​ ​partitions​ ​(Number​ ​of​ ​keys​ ​(estimate))

2.5.​ ​Lightweight​ ​operations

Re-enter​ ​cqlsh​ ​and​ ​your​ ​respective​ ​keyspace

Use​ ​metro_systems;

Let’s​ ​try​ ​a​ ​an​ ​update​ ​on​ ​another​ ​non​ ​existing​ ​entry​ ​but​ ​with​ ​LWT:

UPDATE​ ​users​ ​SET​ ​age​ ​=​ 3


​ 5​ ​WHERE​ ​uid​ ​=​ ​5132b130-ae79-11e4-ab27-0800200c9a88
and​ ​name​ ​=​ ​'Andrew'​ ​IF​ E​ XISTS;

Now​ ​let’s​ ​try​ ​an​ ​update​ ​on​ ​an​ ​existing​ ​entry​ ​with​ ​LWT:

UPDATE​ ​users​ ​SET​ ​age​ ​=​ 4


​ 4​ ​WHERE​ ​uid​ ​=​ ​5132b130-ae79-11e4-ab27-0800200c9a77
and​ ​name​ ​=​ ​'Andrew'​ ​IF​ E​ XISTS;

Exit​ ​CQL

Let’s​ ​force​ ​a​ ​flush​ ​to​ ​write​ ​the​ ​data​ ​on​ ​disk​ ​in​ ​an​ ​SStable.​ ​Run

Nodetool​ ​flush​ ​metro_systems​ ​users​ ​//​ ​For​ ​cloud​ ​users​ ​please​ ​make​ ​sure​ ​you​ ​use
your​ ​own​ ​keyspace

Go​ ​to​ ​cd​ ​/var/lib/data/cassandra/​ ​.​ ​Choose​ ​your​ ​keyspace​ ​directory​ ​and​ ​the​ ​directory
of​ ​the​ ​table.​ ​Use​ ​ls​ ​-l​ ​to​ ​list​ ​the​ ​tables​ ​generated​ ​after​ ​flush.

IF​ ​you​ ​would​ ​like​ ​to​ ​see​ ​how​ ​an​ ​SStable​ ​looks​ ​like​ ​you​ ​can​ ​use​ ​:​ ​sstabledump
nameofthedatafile​ ​//​ ​e.g.​ ​sstabledump​ ​mc-1-big-Data.db

Sstabledump​ ​-e​ ​nameofthedatafile​ ​//​ ​will​ ​show​ ​us​ ​the​ ​partitions​ ​in​ ​the​ ​specific​ ​SStable
Valentina.Crisan@academyofdata.com 

STOP​ ​here​ ​for​ ​theory​ ​update.​ ​Queries​ ​section.

3.​ ​Let’s​ ​do​ ​some​ ​queries​ ​on​ ​the​ ​data​ ​-​ ​ ​we​ ​will​ ​come​ ​back​ ​to
Queries,​ ​this​ ​is​ ​just​ ​to​ ​get​ ​us​ ​going​ ​with​ ​the​ ​syntax

Cd​ ​/

Re-enter​ ​CQL​ ​with​ ​Cqlsh​ ​command​ ​and​ ​use​ ​your​ ​keyspace;

SELECT​ ​*​ ​FROM​ ​users​ ​where​ ​uid​ ​=​ ​5132b130-ae79-11e4-ab27-0800200c9a66;

//​ ​You​ ​can​ ​also​ ​pick​ ​the​ ​columns​ ​to​ ​display​ ​instead​ ​of​ ​choosing​ ​all​ ​data.

SELECT​ ​age,gender​ ​FROM​ ​users​ ​where​ ​uid​ ​=


5132b130-ae79-11e4-ab27-0800200c9a66;

SELECT​ ​*​ ​FROM​ ​users​ ​LIMIT​ ​3;​ ​//​ ​you​ ​can​ ​limit​ ​the​ ​number​ ​of​ ​returned
values

//You​ ​can​ ​fine-tune​ ​the​ ​display​ ​order​ ​using​ ​the​ ​ORDER​ ​BY​ ​clause,​ ​either​ ​descending
or​ ​ascending.​ ​The​ ​partition​ ​key​ ​must​ ​be​ ​defined​ ​in​ ​the​ ​WHERE​ ​clause​ ​and​ ​the​ ​ORDER
BY​ ​clause​ ​defines​ ​the​ ​clustering​ ​column​ ​to​ ​use​ ​for​ ​ordering.​​ ​Order​ ​By​ ​works​ ​only​ ​on
clustering​ ​keys.

SELECT​ ​*​ ​FROM​ ​users​ ​where​ ​uid​ ​=​ ​5132b130-ae79-11e4-ab27-0800200c9a66​ ​ORDER


BY​ ​name​ ​DESC;

//​ ​You​ ​can​ ​query​ ​for​ ​a​ ​partition​ ​key​ ​that​ ​is​ ​part​ ​of​ ​a​ ​list​ ​of​ ​partition​ ​keys

SELECT​ ​*​ ​FROM​ ​users​ ​where​ ​uid​ ​IN​ ​(5132b130-ae79-11e4-ab27-0800200c9a66,


edb032e8-5c4e-403f-8dfc-43c2dcd69dba);

You​ ​can​ ​return​ ​the​ ​number​ ​of​ ​records​ ​that​ ​fulfills​ ​a​ ​request:

SELECT​ ​COUNT(*)​ ​FROM​ ​users​ ​where​ ​uid​ ​=


5132b130-ae79-11e4-ab27-0800200c9a66;

You​ ​can​ ​ask​ ​for​ ​the​ ​distinct​ ​partitions​ ​in​ ​a​ ​table​ ​(DISTINCT​ ​works​ ​only​ ​on​ ​partition
keys):

select​ ​distinct​ ​uid​ ​from​ ​users;

And​ ​also​ ​count​ ​how​ ​many​ ​distinct​ ​partitions​ ​exist:


Valentina.Crisan@academyofdata.com 

select​ ​distinct​ ​uid,​ ​count(*)​ ​from​ ​users;

3.1.​ ​CREATE​ ​AN​ ​INDEX

SELECT​ ​*​ ​FROM​ ​users​ ​WHERE​ ​gender​ ​=​ ​'M';​​ ​//​ ​WILL​ ​NOT​ ​WORK

//let’s​ ​create​ ​an​ ​index

​ NDEX​ ​users_by_gender​ ​ON​ ​users(gender);


CREATE​ I
SELECT​ *​ ​ ​FROM​ ​users​ ​WHERE​ ​gender​ ​=​ ​'F';

//lets​ ​count​ ​how​ ​many​ ​records​ ​are​ ​in​ ​the​ ​query​ ​outcome

SELECT​ ​count(*)​ ​FROM​ ​users​ ​WHERE​ ​gender​ ​=​ ​'F';

Tracing​ ​on;

select​ ​*​ ​from​ ​users​ ​where​ ​gender​ ​=​ ​'F';​ ​ ​//​ ​to​ ​see​ ​all​ ​the​ ​operations​ ​C*​ ​is​ ​doing​ ​in​ ​the
background​ ​(even​ ​for​ ​a​ ​1​ ​node​ ​cluster​ ​you​ ​can​ ​see​ ​loads​ ​of​ ​operations).

CAN​ ​YOU​ ​SEE​ ​THAT​ ​THERE​ ​ARE​ ​2​ ​TABLES​ ​queried​ ​ON​ ​THE​ ​NODE​ ​IN​ ​ORDER​ ​TO​ ​GET
THE​ ​DATA:​ ​on​ ​the​ ​index​ ​table​ ​and​ ​on​ ​the​ ​users​ ​table
​ ​Executing​ ​single-partition​ ​query​ ​on​ ​users.users_by_gender
​ ​Executing​ ​single-partition​ ​query​ ​on​ ​users​ ​--​ ​appears​ ​3​ ​times.​ ​Can​ ​you​ ​tell​ ​me​ ​why?
We​ ​will​ ​come​ ​back​ ​on​ ​this​ ​point​ ​in​ ​section​ ​Indexing​ ​in​ ​C*.

4.Collections

SETS

ALTER​ ​TABLE​ ​users​ ​ADD​ ​emails​ ​set​ ​<text>;

Let’s​ ​insert​ ​some​ ​data​ ​in​ ​the​ ​set:

insert​ ​into​ ​users​ ​(uid,name,age,emails,gender,occupation,zipcode)


values(uuid(),'Lilian',
18,{'lilian@gmail.com','lilian_briggs@oxford.com'},'F',​ ​'student',​ ​11503);

Select​ ​*​ ​from​ ​users;

Let’s​ ​add​ ​some​ ​data​ ​to​ ​existing​ ​entries:

UPDATE​ ​users​ ​SET​ ​emails​ ​=​ ​emails​ ​+​ ​{'elaine@yahoo.com'}​ ​WHERE​ ​uid​ ​=
89d3a708-0571-440a-8a9b-997856c17789​ ​AND​ ​name​ ​=​ ​'Elaine';
Valentina.Crisan@academyofdata.com 

Let’s​ ​delete​ ​some​ ​data​ ​from​ ​existing​ ​entries:

UPDATE​ ​users​ ​SET​ ​emails​ ​=​ ​emails​ ​-​ ​{'elaine@yahoo.com'}​ ​WHERE​ ​uid​ ​=
89d3a708-0571-440a-8a9b-997856c17789​ ​AND​ ​name​ ​=​ ​'Elaine';

SELECT​ ​*​ ​FROM​ ​USERS;

create​ ​index​ ​on​ ​users(emails);

select​ ​*​ ​from​ ​users​ ​where​ ​emails​ ​CONTAINs​ ​'lilian@gmail.com';

LISTS

LET’S​ ​ADD​ ​A​ ​LIST​ ​COLUMN​ ​TO​ ​TABLE​ ​USERS:

ALTER​ ​TABLE​ ​users​ ​ADD​ ​top_movies​ ​list<text>;

LET’S​ ​ADD​ ​SOME​ ​DATA​ ​TO​ ​EXISTING​ ​ENTRIES:

update​ ​users​ ​set​ ​top_movies​ ​=​ ​['titanic','gatsby']​ ​where​ ​uid​ ​=


89d3a708-0571-440a-8a9b-997856c17789​ ​and​ ​name​ ​=​ ​'Elaine';

Select​ ​*​ ​from​ ​users;

update​ ​users​ ​set​ ​top_movies​ ​=​ ​['romeo​ ​and​ ​juliet']+​ ​top_movies​ ​where​ ​uid​ ​=
89d3a708-0571-440a-8a9b-997856c17789​ ​and​ ​name​ ​=​ ​'Elaine';

Select​ ​*​ ​from​ ​users;

update​ ​users​ ​set​ ​top_movies​ ​=​ ​top_movies​ +


​ ​ ​['the​ ​renevant']​ ​ ​where​ ​uid​ ​=
89d3a708-0571-440a-8a9b-997856c17789​ ​and​ n ​ ame​ ​=​ ​'Elaine';

Let’s​ ​replace​ ​position​ ​1​ ​Titanic​ ​with​ ​Blood​ ​Diamonds:

update​ ​users​ ​set​ ​top_movies[1]='blood​ ​diamonds'​ w


​ here​ ​uid​ ​=
89d3a708-0571-440a-8a9b-997856c17789​ ​and​ ​name​ ​=​ ' ​ Elaine';

Select​ ​*​ ​from​ ​users;

Let’s​ ​delete​ ​entry​ ​1​ ​from​ ​the​ ​list.​ ​//blood​ ​diamonds

delete​ ​top_movies[1]​ ​from​ ​users​ ​ ​where​ ​uid​ ​=


89d3a708-0571-440a-8a9b-997856c17789​ ​and​ ​name​ ​=​ ​'Elaine';
Valentina.Crisan@academyofdata.com 

select​ ​name​ ​,​ ​top_movies​ ​from​ ​users​ ​where​ ​uid​ =



89d3a708-0571-440a-8a9b-997856c17789​ ​and​ ​name​ = ​ ​ ​'Elaine';​ ​//​ ​we​ ​select​ ​only​ ​the
collection

You​ ​can​ ​index​ ​a​ ​collection,​ ​let’s​ ​query​ ​on​ ​a​ ​movie​ ​name:

create​ ​index​ ​top_movies​ ​on​ ​users(top_movies);

select​ ​*​ ​from​ ​users​ ​where​ ​top_movies​ ​contains​ ​'gatsby';

MAPS
LET’S​ ​ADD​ ​A​ ​MAP​ ​COLUMN​ ​TO​ ​TABLE​ ​USERS:

ALTER​ ​TABLE​ ​users​ ​ADD​ ​movies​ ​map<timestamp,​ ​text>;

LET’S​ ​INSERT​ ​DATA:

update​ ​users​ ​set​ ​movies​ ​=​ ​{'2016-06-05':'rated


movie','2016-06-06':'screened​ ​trailer'}​ ​where​ ​uid​ ​=
89d3a708-0571-440a-8a9b-997856c17789​ ​and​ ​name​ ​=​ ​'Elaine';

Select​ ​movies​ ​from​ ​users;

You​ ​can​ ​use​ ​a​ ​collection​ ​as​ ​a​ ​primary​ ​key.

STOP​ ​here​ ​for​ ​theory​ ​update.

5.​ ​Copy​ ​data​ ​from/to​ ​a​ ​CSV​ ​file

Let’s​ ​start​ ​working​ ​on​ ​some​ ​real​ ​data.

5.1.​ ​Copy​ ​from​ ​a​ ​CSV​ ​file

Drop​ ​table​ ​users;

Let’s​ ​re-create​ ​table​ ​users​ ​with​ ​below​ ​syntax​ ​and​ ​import​ ​data​ ​from​ ​a​ ​CSV​ ​file.

CREATE​ ​TABLE​ ​users(


​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​uid​ ​int,
​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​age​ ​int,
​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​gender​ ​text,
​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​occupation​ ​int,
​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​zipcode​ ​text,
​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​PRIMARY​ ​KEY​ ​(uid));
Valentina.Crisan@academyofdata.com 

WE​ ​HAVE​ ​IN​ ​THE​ ​DATA​ ​DIRECTORY​ ​(CLOUD​ ​OR​ ​LOCAL​ ​INSTALLATIONS)​ ​THE​ ​CSV
FILES​ ​THAT​ ​WE​ ​WILL​ ​BE​ ​WORKING​ ​WITH.​ ​WE​ ​NEED​ ​Users.csv

copy​ ​users(uid,gender,age,occupation,zipcode)​ ​from​ ​'/data/users.csv'​ ​with


header​ ​=​ ​true;

select​ ​*​ ​from​ ​users​ ​limit​ ​10;

5.2.​ ​Copy​ ​to​ ​a​ ​CSV​ ​file

COPY​ ​users(uid,gender,age)​ ​TO​ ​'/data/yourname.csv'​ ​WITH​ ​HEADER=TRUE;

6.1​ ​Collections​ ​as​ ​Primary​ ​key

Did​ ​you​ ​know​ ​we​ ​can​ ​have​ ​a​ ​collection​ ​as​ ​part​ ​Primary​ ​Key?​ ​So​ ​if​ ​we​ ​want​ ​to
know​ ​movies​ ​by​ ​genres​ ​-​ ​we​ ​could​ ​define​ ​Genre​ ​as​ ​part​ ​of​ ​the​ ​Primary​ ​key.

CREATE​ ​TABLE​ ​movies_by_genres​ ​(


movieid​ ​int,
title​ ​text,
genres​ ​frozen​ ​<set<text>>,
Year​ ​int,
PRIMARY​ ​KEY​ ​(genres,​ ​movieid));

copy​ ​movies_by_genres(movieid,title,genres,year)​ ​from​ ​'/data/movies.csv'​ ​with​ ​header​ ​=


true;

Select​ ​*​ ​from​ ​movies_by_genres​ ​limit​ ​10;

6.2.​ ​Lets​ ​also​ ​do​ ​an​ ​exercise​ ​on​ ​User​ ​Defined​ ​Types

CREATE​ ​TYPE​ ​address​ ​(


Street​ ​text,
Number​ ​int,
Flat​ ​text);

CREATE​ ​TABLE​ ​users_udt​ ​(


Uid​ ​int,
Address​ ​frozen<address>,
Primary​ ​key​ ​(uid));
Valentina.Crisan@academyofdata.com 

//​ ​When​ ​using​ ​the​ ​frozen​ ​keyword,​ ​you​ ​cannot​ ​update​ ​parts​ ​of​ ​a
user-defined​ ​type​ ​value.​ ​The​ ​entire​ ​value​ ​must​ ​be​ ​overwritten.​ ​Cassandra
treats​ ​the​ ​value​ ​of​ ​a​ ​frozen,​ ​user-defined​ ​type​ ​like​ ​a​ ​blob.

INSERT​ ​INTO​ ​users_udt(uid,address)​ ​VALUES​ ​(1,​ ​{street​ ​:​ ​'Pipera',​ ​number​ ​:


34,​ ​flat​ ​:​ ​'34C'});

Select​ ​*​ ​from​ ​users_udt;

//Let’s​ ​see​ ​if​ ​we​ ​can​ ​select​ ​only​ ​parts​ ​of​ ​the​ ​UDT
select​ ​address.street​ ​from​ ​users_udt​ ​where​ ​uid​ ​=​ ​1;

//​ ​Let’s​ ​see​ ​if​ ​we​ ​can​ ​update​ ​parts​ ​of​ ​the​ ​UDT

update​ ​users_udt_1​ ​set​ ​address.street​ ​=​ ​'Vitan'​ ​where​ ​uid​ ​=​ ​1;​ ​//​ ​will​ ​not
work

Important​:From​ ​3.4.2:​ ​User-defined​ ​types​ ​may​ ​now​ ​be​ ​stored​ ​in​ ​a​ ​non-frozen
form,​ ​allowing​ ​individual​ ​fields​ ​to​ ​be​ ​updated​ ​and​ ​deleted​ ​in​ ​UPDATE
statements​ ​and​ ​DELETE​ ​statements,​ ​respectively.​ ​(CASSANDRA-7423).​ ​Let’s
test.

CREATE​ ​TABLE​ ​users_udt_1​ ​(


Uid​ ​int,
Address​ ​address,
Primary​ ​key​ ​(uid));

INSERT​ ​INTO​ ​users_udt_1(uid,address)​ ​VALUES​ ​(1,​ ​{street​ ​:​ ​'Pipera',​ ​number


:​ ​34,​ ​flat​ ​:​ ​'34C'});

Select​ ​*​ ​from​ ​users_udt_1;

update​ ​users_udt_1​ ​set​ ​address.street​ ​=​ ​'Vitan'​ ​where​ ​uid​ ​=​ ​1;

Select​ ​*​ ​from​ ​users_udt_1;


Valentina.Crisan@academyofdata.com 

​ ​7.​ ​Let’s​ ​do​ ​some​ ​data​ ​deletes:

SELECT​ ​*​ ​from​ ​users​ ​limit​ ​10;

Delete​ ​value​ ​of​ ​a​ ​cell:


select​ ​*​ ​from​ ​users​ ​where​ ​uid​ ​=​ ​4317;

delete​ ​age​ ​from​ ​users​ ​where​ ​uid​ ​=​ ​4317;

select​ ​*​ ​from​ ​users​ ​where​ ​uid​ ​=​ ​4317;

Delete​ ​a​ ​partition:


delete​ ​from​ ​users​ ​where​ ​uid​ ​=​ ​4317;

Before​ ​we​ ​move​ ​on

//see​ ​table​ ​statistics

nodetool​ ​cfstats​ ​metro_systems.users

//​ ​see​ ​number​ ​of​ ​keys​ ​=​ ​number​ ​of​ ​partitions.

​ fhistograms​ ​ ​metro_systems​ ​users


nodetool​ c
or
nodetool​ t​ ablehistograms​ ​metro_systems​ ​users

Select​ ​token(uid)​ ​from​ ​users1​ ​where​ ​uid=1;​ ​//​ ​//​ ​we​ ​want​ ​to​ ​see​ ​which​ ​token​ ​is
allocated​ ​for​ ​this​ ​partition

Basic​ ​data​ ​model​ ​--​ ​update​ ​theory

Exercise​ ​8:​ ​Time​ ​series:​ ​weather​ ​stations


CREATE​ ​TABLE​ ​temperature​ ​(
w_id​ ​text,
event_time​ ​timestamp,
temperature​ ​int,
PRIMARY​ ​KEY​ ​(w_id,event_time)
)​ ​with​ ​clustering​ ​order​ ​by​ ​(event_time​ ​DESC);

Let’s​ ​import​ ​data​ ​in​ ​the​ ​temperature​ ​table:


Valentina.Crisan@academyofdata.com 

copy​ ​temperature(w_id,event_time,temperature)​ ​from​ ​'/data/weatherdata.csv'​ ​with


header​ ​=​ ​true​ ​and​ ​maxrows​ ​=​ ​100000;

Select​ ​*​ ​from​ ​temperature​ ​limit​ ​30;

Let’s​ ​look​ ​for​ ​the​ ​data​ ​(time​ ​&​ ​temp)​ ​on​ ​a​ ​single​ ​weather​ ​station,​ ​how​ ​would​ ​the
query​ ​look​ ​like?

select​ ​event_time,​ ​temperature​ ​from​ ​temperature​ ​where​ ​w_id​ ​=​ ​'PNOSCPQCNX';

What​ ​about​ ​all​ ​the​ ​temperatures​ ​between​ ​2016-01-10​ ​03:49:46​ ​and​ ​2016-01-11
16:29:59?

select​ ​temperature​ ​from​ ​temperature​ ​where​ ​w_id​ ​=​ ​'PNOSCPQCNX'​ ​AND


event_time​ ​>​ ​'2016-01-10​ ​03:49:46'​ ​AND​ ​event_time​ ​<​ ​'2016-01-11​ ​16:29:59';

Average​ ​temperatures​ ​between​ ​2016-01-10​ ​03:49:46​ ​and​ ​2016-01-11​ ​16:29:59?

select​ ​avg(temperature)​ ​from​ ​temperature​ ​where​ ​w_id​ = ​ ​ ​'PNOSCPQCNX'​ ​AND


event_time​ ​>​ ​'2016-01-10​ ​03:49:46'​ ​AND​ ​event_time​ ​<​ '​ 2016-01-11​ ​16:29:59';

Create​ ​table​ ​temperature_by_day​ ​(​ ​solution​ ​for​ ​wide​ ​rows)

CREATE​ ​TABLE​ ​temperature_by_day​ ​(


​ ​ ​ ​w_id​ ​text,
​ ​ ​ ​date​ ​date,
​ ​ ​ ​event_time​ ​timestamp,
​ ​ ​ ​temperature​ ​int,
​ ​ ​ ​PRIMARY​ ​KEY​ ​((w_id,date),event_time)
);

Let’s​ ​import​ ​data​ ​in​ ​the​ ​new​ ​table:

copy​ ​temperature_by_day​ ​(w_id,event_time,date,temperature)​ ​from


'/data/weatherdata-ext.csv'​ ​with​ ​header​ ​=​ ​true​ ​and​ ​maxrows​ ​=​ ​100000;

Query​ ​the​ ​weather​ ​data​ ​for​ ​the​ ​single​ ​day​ ​and​ ​then​ ​2​ ​consecutive​ ​days
SELECT​ ​*​ ​FROM​ ​temperature_by_day​ ​WHERE​ ​w_id='FFOERIRWOL'​ ​AND​ ​date='2015-03-19';
SELECT​ ​*​ ​FROM​ ​temperature_by_day​ ​WHERE​ ​w_id='FFOERIRWOL'​ ​AND​ ​date​ ​in
('2015-03-19','2015-03-18');
Valentina.Crisan@academyofdata.com 

Let’s​ ​see​ ​how​ ​to​ ​query​ ​using​ ​a​ ​non​ ​primary​ ​key​ ​column:

This​ ​query​ ​will​ ​not​ ​work​ ​-​ ​because​ ​we​ ​are​ ​trying​ ​to​ ​query​ ​without​ ​partition​ ​key:

select​ ​*​ ​from​ ​temperature_by_day​ ​where​ ​temperature​ ​=​ ​29;

In​ ​order​ ​to​ ​work​ ​either​ ​we​ ​allow​ ​filtering​ ​of​ ​data:

select​ ​*​ ​from​ ​temperature_by_day​ ​where​ ​temperature​ ​=​ 2 ​ 9​ a​ llow​ ​filtering;​ ​//​ ​use
this​ ​with​ ​caution,​ ​brings​ ​all​ ​the​ ​table​ ​data​ ​locally​ i​ n​ o​ ne​ ​coordinator.

Or​ ​we​ ​create​ ​an​ ​index​ ​on​ ​temperature

​ ndex​ ​on​ ​temperature_by_day(temperature);


Create​ i
select​ *​ ​ ​from​ ​temperature_by_day​ ​where​ ​temperature​ ​=​ ​29

Deleting​ ​old​ ​entries​ ​-​ ​ ​WE​ ​CAN​ ​SET​ ​EXPIRING​ ​DATA​ ​AT​ ​TABLE​ ​LEVEL​ ​OR
PER​ ​ENTRY

EXPIRE​ ​DATA​ ​PER​ ​TABLE

CREATE​ ​TABLE​ ​latest_temperatures​ ​(


w_id​ ​text,
event_time​ ​timestamp,
temperature​ ​int,
PRIMARY​ ​KEY​ ​(w_id,event_time),
)​ ​WITH​ ​CLUSTERING​ ​ORDER​ ​BY​ ​(event_time​ ​DESC)​ ​AND​ ​default_time_to_live​ ​=​ ​10;

copy​ ​latest_temperatures(w_id,event_time,temperature)​ ​from​ ​'/data/weatherdata.csv'


with​ ​header​ ​=​ ​true​ ​and​ ​maxrows​ ​=​ ​10000;

Select​ ​*​ ​from​ ​latest_temperatures

Repeat​ ​select​ ​in​ ​10​ ​sec

EXPIRE​ ​DATA​ ​PER​ ​ENTRY

INSERT​ ​INTO​ ​latest_temperatures(w_id,event_time,temperature)​ ​VALUES


('12345',​ ​'2015-09-17​ ​21:29:35',​ ​20)​ ​USING​ ​ ​TTL​ ​20;

Select​ ​*​ ​from​ ​latest_temperatures;​ ​//​ ​in​ ​20​ ​s

Das könnte Ihnen auch gefallen